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WARNING 

Some of the circuitry inside this system operates at hazardous energy and 
electric shock voltage levels. To avoid the risk of personal injury due to 
contact with an energy hazard, or risk of electric shock, do not enter any 
portion of this system unless it is intended to be accessible without the use 
of a tool. The areas that are considered accessible are the outer enclosure 
and the area just inside the front door when all of the front panels are in¬ 
stalled, and the front of the diagnostic station. There are no user service¬ 
able areas inside the system. Refer any need for such access only to tech¬ 
nical personnel that have been qualified by Intel Corporation. 

CAUTION 

This equipment has been tested and found to comply with the limits for a 
Class A digital device, pursuant to Part 15 of the FCC Rules. These limits 
are designed to provide reasonable protection against harmful interfer¬ 
ence when the equipment is operated in a commercial environment. This 
equipment generates, uses, and can radiate radio frequency energy and, 
if not installed and used in accordance with the instruction manual, may 
cause harmful interference to radio communications. Operation of this 
equipment in a residential area is likely to cause harmful interference in 
which case the user will be required to correct the interference at his own 
expense. 


LIMITED RIGHTS 

The information contained in this document is copyrighted by and shall re¬ 
main the property of Intel Corporation. Use, duplication or disclosure by the 
U.S. Government is subject to Limited Rights as set forth in subparagraphs 
(a) (15) of the Rights in Technical Data and Computer Software clause at 
252.227-7013. Intel Corporation, 2200 Mission College Boulevard, Santa 
Clara, CA 95052. For all Federal use or contracts other than DoD Limited 
Rights under FAR 52.2272-14, ALT. Ill shall apply. Unpublished—rights 
reserved under the copyright laws of the United States. 
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The handler function you define must be written in C and must have four arguments of type long. 
These arguments are passed the following values when the function is called: 

1. Type of the message (as returned by infotypeO). 

2. Length of the message in bytes (as returned by infocountO). 

3. Node number of the process that sent the message (as returned by infonodeO). 

4. Process type of the process that sent the message (as returned by infoptypeO). 

For example, here’s a C code fragment that attaches the functions JunctO(),fiinctl(), and fiinct2() to 
message types 0,1, and 2, respectively. The message types that lave handlers are referred to as 
handled types. 

♦include <nx.h> 

char bufO[100], bufl[100], buf2[100]; 
void functO(), functl(), funct2(); 

hrecv(0, bufO, sizeof(buf0), functO); 
hrecv(l, bufl, sizeof(buf1), functl); 
hrecv(2, buf2, sizeof(buf2), funct2); 

• /* Now perform other work. No blocking happens. */ 


The declaration of fiinctlO looks like this (the other functions are similar): 

void functl(long type, long count, long node, long ptype) 
{ 


} 

When a message of type 1 arrives, the message is stored in the buffer specified in the hrecvO call 
(in this case, bufl), then fiinctlO is called with the type and length of die message and the node 
number and process type of the sender as arguments. fiinctlO and the main program then run 
concurrently until fimctlO returns. (In previous releases of Paragon OSF/1 , the main program was 
interrupted and did not run at all until fiinctlO returned.) 


CAUTION 

The handler runs in the same memory space as the main program 
(but they have separate stacks). 
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Preface 


This manual tells how to use the Paragon™ OSF/1 operating system on an Intel supercomputer. 

This manual assumes that you are an application programmer proficient in the C or Fortran language 
and the UNIX operating system. The manual provides you with enough detail to begin using your 
system. 

NOTE 

Programming examples in this manual are intended to 
demonstrate the use of Paragon OSF/1 system calls, not as 
examples of good programming practice. 

For example, in some cases, the return values of functions are not checked for error conditions. This 
is not recommended, but the error checks have been omitted in order to make the example shorter 
and easier to read. 


Organization 

Chapter 1 Provides an overview of the Paragon OSF/1 software and Intel 

supercomputer hardware. 

Chapter 2 Describes the Paragon OSF/1 commands that you can enter at the shell 

prompt and the Paragon OSF/1 cross-development commands that run on 
supported workstations. 

Chapter 3 Describes the message-passing system calls available to programs in Paragon 

OSF/1. 

Chapter 4 Describes the other general-purpose system calls available in Paragon OSF/1. 
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Chapter 5 Describes the parallel I/O calls you can use for parallel access to the Intel 
supercomputer’s file systems. 

Chapter 6 Describes the pthreads package, which you can use to create and control 

multiple threads (also called “lightweight processes”) within your programs. 

Chapter 7 Tells how to prepare an application for the Paragon OSF/1 operating system. 

The steps described are applicable to applications that are written for a 
parallel computer and applications that are ported from a sequential 
computer. This chapter discusses three examples: an integration, a 
matrix*vector multiplication, and the N-Queens problem. 

Chapter 8 Presents some techniques you can use to improve the performance of your 
parallel applications. 

Appendix A Summarizes the commands and system calls of Paragon OSF/1. The 

complete syntax of each command and call is provided, along with a brief 
description of each. 

Appendix B Describes the level of support offered by Paragon OSF/1 for the commands 
and system calls of the iPSC® system. 


Notational Conventions 

This manual uses the following notational conventions: 

Bold Identifies command names and switches, system call names, reserved words, 

and other items that must be used exactly as shown. 

Italic Identifies variables, filenames, directories, partitions, user names, and writer 

annotations in examples. Italic type style is also occasionally used to 
emphasize a word or phrase. 

Plain-Monospace 

Identifies computer output (prompts and messages), examples, and values of 
variables. 

Bold-Italic-Monospace 

Identifies user input (what you enter in response to some prompt). 
Bold-Monospace 

Identifies the names of keyboard keys (which are also enclosed in angle 
brackets). A dash indicates that the key preceding the dash is to be held down 
while the key following the dash is pressed. For example: 

<Break> <s> <Ctrl-Alt-Del> 
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[ ] (Brackets) Surround optional items. 

(Ellipsis dots) Indicate that the preceding item may be repeated. 

I (Bar) Separates two or more items of which you may select only one. 

{ } (Braces) Surround two or more items of which you must select one. 

Applicable Documents 

For more information, refer to the following manuals. See the Paragon ™ System Technical 
Documentation Guide for information on the complete Paragon document set and ordering 
information. 

Paragon™ Manuals 

• Paragon™ Commands Reference Manual 

TM 

• Paragon Network Queueing System Manual 

• Paragon™ C Compiler User's Guide 

• Paragon™ Fortran Compiler User's Guide 

• Paragon™ C System Calls Reference Manual 

• Paragon™ Fortran System Calls Reference Manual 

• Paragon™ Application Tools User's Guide 

• Paragon™ Interactive Parallel Debugger Reference Manual 

• Paragon™ System Administrator's Guide 

For information about limitations and workarounds, see the Paragon™ System Software Release 
Notes for the Paragon™ XPIS System . Release notes are also located in the directory 
Ivollsharelreleasejiotes on your Paragon system. 
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Other Manuals 

• OSFIl User’s Guide 

• OSF/1 Programmer’s Reference 

• OSFIl Command Reference 

• Effective Fortran 77 - Michael Metcalf 

• C: A Reference Manual - Harbison and Steele 

• The C Programming Language - Kemighan and Ritchie 
CLASSPACK Basic Math Library User’s Guide - Kuck & Associates 

• CLASSPACK Basic Math Library/C User’s Guide - Kuck & Associates 
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Comments and Assistance 


Intel Supercomputer Systems Division is eager to hear of your experiences with our products. Please 
call us if you need assistance, have questions, or otherwise want to comment on your Paragon 
system. 


U.S.A./Canada Intel Corporation 
phone: 800-421-2823 
Internet: support@ssddntel.com 


Intel Corporation Italia s.p.a. 

Milanofiori Palazzo 
20090 Assago 
Milano 
Italy 

1678 77203 (toll free) 

France Intel Corporation 

1 Rue Edison-BP303 

78054 St. Quentin-en-Yvelines Cedex 

France 

0590 8602 (toll free) 

Intel Japan K.K. 

Supercomputer Systems Division 

5-6 Tokodai, Tsukuba City 
Ibaraki-Ken 300-26 
Japan 

0298-47-8904 


United Kingdom Intel Corporation (UK) Ltd. 

Supercomputer System Division 

Pipers Way 

Swindon SN3IRJ 

England 

0800 212665 (toll free) 

(44) 793 491056 ( answered in French) 

(44) 793 431062 0 answered in Italian) 

(44) 793 480874 ( answered in German) 

(44) 793 495108 ( answered in English) 


Germany Intel Semiconductor GmbH 

Domacher Strasse 1 

8016 Feldkirchen bel Muenchen 

Germany 

0130 813741 (toll free) 


World Headquarters 
Intel Corporation 
Supercomputer Systems Division 
15201 N.W. Greenbrier Parkway 
Beaverton, Oregon 97006 
U.S.A. 

(503) 629-7600 (Monday through Friday, 8 AM to 5 PM Pacific Time) 
Fax: (503) 629-9147 


If you have comments about our manuals, please fill out and mail the enclosed Comment Card. You 
can also send your comments electronically to the following address: 

techpubs@ssdantel.com 


ix 



Preface 


Paragon™ User’s Guide 




Table of Contents 



Chapter 1 
Introduction 

Introduction. 1-1 

System Hardware.1-2 

Nodes.1-2 

Node Interconnect Network.1-2 

I/O Interfaces.1-2 

Front Panel LEDs (Paragon™ XP/S Only) .1-3 

System Software.1-4 

Paragon™ OSF/1 Operating System.1-4 

User Model.1-5 

Programming Model.1-6 

Cross-Development Facility. 1-6 


xi 















Table of Contents 


Paragon™ User’s Guide 


Chapter 2 

TIVI 

Using Paragon OSF/1 Commands 

Introduction . 2-1 

Terminology ..............2-1 

Using Paragon™ OSF/1 Commands on the Intel Supercomputer ......2-2 

Using Paragon™ OSF/1 Commands on Workstations ......2-2 

A Quick Example.....................2-3 

Information You Need..................2-3 

Compiling, Unking, and Executing an Application .......2-3 

Compiling and Linking Applications .. . 2-5 

Configuring Your Environment for Cross-Development.......2-6 

Tips for Compiling and Linking .....2-8 

Using Other Switches. 2-8 

Including nx.h or fnx.h ....... .....2-8 

Specifying Include File and Library Pathnames ........2-8 

Preprocessing a Fortran Program .........2-9 

Order of Switches.............2-10 

Running Applications . . . . . 2-11 

I/O Redirection....................2-12 

Controlling the Application’s Execution Characteristics ..........2-13 

Using the Default Partition.............2-14 

Setting Your Default Partition ............2-14 

Determining the Current Default Partition................2-15 

Specifying Application Size.........2-15 

Specifying a Rectangle of Nodes ..............2-16 

Specifying a Particular Rectangle of Nodes..........2-16 

Using the Default Size.............2-17 

Specifying Application Priority .........2-17 

Specifying Process Type .....2-18 

Running a Program on a Subset of the Nodes...2-18 


xii 































Paragon™ User's Guide 


Table of Contents 


Running Applications Consisting of Multiple Programs.2-21 

Running an Application in a Particular Partition .2-22 

Managing Running Applications . 2-23 

Managing Partitions . 2-25 

Special Partitions.2-26 

The Root Partition.2-26 

The Service Partition.2-27 

The Compute Partition.2-27 

Partition Pathnames.2-28 

Partition Characteristics.2-29 

Parent Partition.2-29 

Partition Name.2-30 

Nodes Allocated to the Partition.2-30 

Node Numbers Within a Partition.2-30 

Unusable Nodes.2-31 

Owner, Group, and Protection Modes.2-32 

Scheduling Characteristics.2-33 

Standard Scheduling.2-34 

Gang Scheduling.2-35 

Space Sharing.2-37 

Summary of Scheduling Types.2-38 

A Scheduling Example.2-39 

Making Partitions.<.2-39 

Specifying the Nodes Allocated to the Partition.....2-40 

Specifying Protection Modes.2-42 

Specifying Scheduling Characteristics.2-43 

Removing Partitions. 2-45 

Removing Partitions Containing Running Applications...2-45 

Removing Partitions Containing Subpartitions.2-45 


xiii 
































Table of Contents 


Paragon™ User’s Guide 


Showing Partition Characteristics ..........2-46 

Showing Free Nodes ..............2-48 

Listing Subpartitions ...........2-49 

Recursively Listing Subpartitions........... ..2-50 

Listing the Applications in a Partition..............2-51 

Applications in Subpartitions .....2-52 

Recursively Listing Applications in Subpartitions ........2-54 

Changing Partition Characteristics..................2-54 


Chapter 3 

Using Paragon™ OSF/1 Message-Passing System Calls 


Introduction .............. 3-i 

Process Characteristics... . ..3-3 

Node Numbers...........3-3 

Process Types..............3-4 

Message Characteristics. 3-5 

Message Length.........3-5 

Message Type........3-6 

Message ID.........3-6 

Message Order....... ......... . .....3-7 

Names of Send and Receive Calls. ... . 3-7 

Synchronous Send and Receive . ....... 3-8 

Synchronous Send to Multiple Nodes .........3-9 

Asynchronous Send and Receive ....... 3-10 

Releasing Message IDs ..........3-12 

Merging Message IDs .............3-13 

Probing for Pending Messages. 3-14 


xlv 



























Paragon™ User's Guide 


Table of Contents 


Getting Information About Pending or Received Messages. 3-15 

Message Passing with Fortran Commons. 3-17 

Treating a Message as an Interrupt. 3 - 1 8 

Passing Information to the Handler.3-20 

Preventing Interrupts.3-22 

Extended Receive and Probe. 3-24 

Global Operations... 3-27 


Chapter 4 

Using Other Paragon™ OSF/1 
System Calls 

Introduction. 4-1 

Managing Applications .4-2 

Controlling Application Execution with System Calls.4-3 

Creating an Application with nx_initve() .4-4 

Creating a Rectangular Application with nx_initve_rect().4-7 

Setting an Application’s Priority with nx_pri() .4-9 

Copying a Process onto the Nodes with nx_nfork() .4-10 

Loading a Program onto the Nodes with nx_load() ....4-11 

Loading a Program onto the Nodes with nx_loadve() . 4-13 

Waiting for Application Processes with nx_waitall() .4-14 

Using PIDs. 4-14 

Getting Information About Applications.4-16 

Finding an Application’s Shape with nx_app_rect().4-16 

Listing an Application’s Nodes with nx_app_nodes(). 4-17 

Listing the Applications in a Partition with nx_pspart().........4-19 


XV 

























Table of Contents 


Paragon™ User’s Guide 


The Controlling Process .........................4-21 

Process Groups ..............4-22 

Process Groups in Paragon™ OSF/1 .......................4-23 

Killing Application Processes..........4-23 

An Example Controlling Process..........—......4-23 

Message Passing Between Controlling Process and Application Processes...........4-25 

Managing Partitions . ... ......4-27 

Making Partitions...............4-28 

Removing Partitions...........4-30 

Getting Information About Partitions........4-31 

Determining a Partition’s Attributes with nx_part_attr() .............4-32 

Determining a Partition’s Nodes with nx_part_nodes() .......4-35 

Changing Partition Characteristics.......4-36 

Listing Unusable Nodes .4-40 

Handling Errors ....................4-42 

Underscore Calls.......4-42 

Core Dumps..............4-44 

Controlling Floating-Point Behavior . .....4-46 

Detecting Not-a-Number.......4-47 

Controlling Floating-Point Behavior......4-47 

Rounding Mode.........4-47 

Exception Mask and Sticky Flags...............4-48 

Fortran Exception Mask Values.............4-49 

Miscellaneous Calls ... . 4-50 

Temporarily Releasing Control of the Processor.......4-50 

Timing Execution ...........4-50 

iPSC® and Touchstone DELTA Compatibility Calls .. .. 4-52 


xvi 































Paragon 1 ** User's Guide 


Table of Contents 


Chapter 5 

Using Parallel File I/O 

Introduction . 5-1 

Disks and File Systems.5-2 

PFS File Systems and PFS Files.5-3 

PFS Filenames and Pathnames.5-3 

PFS Limitations.5-4 

Using PFS Commands .5-5 

Displaying File System Attributes.5-5 

Increasing the Size of a File.5-7 a 

Using Parallel I/O Calls .5-8 

Opening Files in Parallel .5-9 

Using gopen() in C.5-10 

Using gopen() in Fortran.5-10 

Opening Files with Standard Operations.5-11 

Special Considerations for Fortran.5-11 

Formatted Versus Unformatted I/O..5-11 

New Files.5-12 

Unnamed Files.5-13 

Using I/O Modes . 5-13 

M_UNIX (ModeO) .5-14 

M_LOG (Mode 1).5-15 

M_SYNC (Mode 2).5-15 

M_RECORD (Mode 3) .5-16 

M_GLOBAL (Mode 4) . 5-17 

An I/O Mode Example.5-17 

Fortran Example. 5-18 

C Example.5-19 

Compiling and Running the Example.5-20 

xvii 






























Table of Contents 


Paragon™ User’s Guide 


MJJNIX Output..................5-21 

M_LOG Output................5-22 

M_SYNC Output......................5-22 

M_RECORD Output.......5-23 

MJ3LOBAL Output ........5-23 

Reading and Writing Files in Parallel... ....5-24 

Synchronous File I/O..........5-25 

Asynchronous File I/O.....5-27 

Closing Files in Parallel... .....5-28 

Detecting End>of-File and Moving the File Pointer........5-29 

Flushing Fortran Buffered I/O.5-30 

Using “###” Filenames.........5-31 

Increasing the Size of a File.. ..........5-32 

Using Extended Files.5-33 

OSF/1 Calls that Do Not Support Extended Files ..........5-34 

OSF/1 Commands that Do Not Support Extended Files...5-35 

Manipulating Extended Files. 5-36 

Performing Extended Arithmetic. 5-37 

Getting Information About PFS File Systems....5-39 

Getting Information About All Mounted PFS File Systems......5-39 

Getting PFS Information About a Single File System ..........5-41 

Controlling Tape Devices. 5-44 

Naming Tape Devices...........5-44 

Performing Operations on Tape Devices .....5-45 

Getting Status of Tape Devices ........ 5-46 

Synchronization Summary.5-48 


xviii 




























Paragon™ User's Guide Table of Contents 


Chapter 6 
Using Pthreads 

Introduction. 6-1 

The Pthreads Package.6-1 

What’s In This Chapter.6-2 

Limitations of Pthreads.6-3 

Recommended Safe Operating Environment. 6-4 

Compiling and Linking a Pthread Application.6-5 

Using Reentrant C Library Calls.6-6 

Using Pthreads Library Calls..6-i i 

Pthreads Library Data Types and Symbols.6-11 

The Main Thread. 6-12 

Managing Pthread Execution.6-13 

Managing Pthread Attributes.6-15 

Managing Mutexes. 6-16 

Managing Mutex Attributes.6-17 

An Example Pthreads Program.6-18 

Using Condition Variables to Synchronize Pthreads.6-21 

Managing Condition Attributes. 6-23 

Examples of Condition Variables.6-24 

Canceling Pthreads. 6-28 

Cancelability States. 6-28 

Cancellation Examples..6-30 

Pthreads Cleanup Routines...6-32 

Managing Pthread Keys.6-83 

Executing a Routine Once..6-34 

Managing Signals.6-34 

Interfacing with Non-Thread-Safe Code.6-37 

Message Passing and Pthreads Library Calls.6-37 

xix 






























Table of Contents 


Paragon™ User’s Guide 


File I/O and Pthreads Library Calls ... . . 6-38 

nx_nfbrk() and nx_initveO and Pthreads Library Calls . . 6-39 

Signals and Pthreads Library Calls .. . 6-39 

Signal Types................6-39 

Signals are a Per-Process Resource....................6-40 

Dealing with Signals..........6-41 

Handling Errors .......... 6-41 

errno Confusion.....6-41 

perror() and nx_perror() .................6-42 

Calling exit()...............6-42 

Use of Underscore Versions of Paragon System Calls.....6-43 

Catch Signals Causing Core Dump by Default............6-43 

When One Pthread Hangs.........6-43 

Chapter 7 

Designing a Parallel Application 

Introduction ................. 7-1 

TM 

The Paragon OSF/1 Programming Model ... 7-2 

Parallel Programming Techniques .. .. . 7-2 

Separating the User Interface from the Computation...7-3 

Balancing the Load ..............7-3 

Domain Decomposition ......7-3 

Control Decomposition ...........7-5 

Making the Program Independent of the Number of Nodes...........7-5 

Designing Your Communication Strategy .........7-6 

Using Global Operations .......................7-6 

Using Alternate Node Topologies.............7-6 


XX 



























Paragon™ User's Guide 


Table of Contents 


Example Application: Calculating pi .7-7 

Example Application: Matrix*Vector Multiplication . 7-11 

Example Application: The N-Queens Problem . 7-13 

Chapter 8 

Improving Performance 

Introduction . 8-1 

Single Node Performance . 8-2 

Use Profiling Tools.8-2 

Avoid Repeated Use of System Calls.8-2 

Avoid Virtual Memory Paging.8-3 

Use Compiler Optimizations.8-3 

Increase Problem Size.8-5 

Access Contiguous Memory Locations.8-5 

Use Caching Wisely.8-5 

Use Optimized Libraries.8-6 

Use Assembly Language Subroutines.8-7 

Avoid Error Checking (C Language Only).8-7 

Multi-Node Performance .8-7 

Use Dynamic Memory Allocation for Large Arrays.8-8 

Avoid Serializing Calls.8-9 

Use ParaGraph. 8-10 

Maintain Data Locality. 8-10 

Overlap Computation and Communication. 8-10 

Avoid Message Buffering.......8-11 

Align Application Buffers...8-12 


xxi 


























Table of Contents Paragon™ User's Guide 


Understand Message-Passing Flow Control.........8-13 

Overview of Message-Passing Flow Control........8-14 

Process Locking ............8-15 

Packetization ................8-16 

System Message Buffers............8-16 

Message-Passing Configuration Switches.........8-18 

Summary of the Message-Passing Configuration Switches ......8-19 

Default, Maximum, and Minimum Values.........8-20 

Dependencies and Rounding....8-21 

Recommendations..............8-21 

I/O Performance ............................8-23 

Use PFS File Systems ............8-23 

Use gopen() Instead of open()...........8-23 

Use Parallel I/O Calls ................8-24 

Use Asynchronous Calls. 8-24 

Use the Appropriate I/O Mode .............8-24 

Align I/O Buffers with Virtual Memory Pages..........8-25 

Read or Write Whole File System Blocks............8-25 

Make Good Use of File Striping ............8-25 

Appendix A 

Summary of Commands 
and System Calls 

Command Summary. a-i 

Compiling and Linking Applications........A-1 

Running Applications.. A-2 

Managing Partitions.A-2 

Parallel File System Commands.........A-3 

Miscellaneous Commands............ A-3 


xxii 




























Paragon™ User's Guide 


Table of Contents 


C System Call Summary. a-4 

Process Characteristics.A-4 

Synchronous Send and Receive.A-5 

Asynchronous Send and Receive.A-6 

Probing for Pending Messages.A-7 

Getting Information About Pending or Received Messages.A-7 

Treating a Message as an Interrupt.A-8 

Extended Receive and Probe.A-9 

Global Operations. A-10 

Controlling Application Execution..7...A-12 

Getting Information About Applications.A-13 

Partition Management.A-14 

Finding Unusable Nodes.A-15 

Handling Errors.A-15 

Floating-Point Control.A-16 

Miscellaneous Calls.A-16 

iPSC® and Touchstone DELTA Compatibility.A-17 

I/O Modes.A-18 

Reading and Writing Files in Parallel.A-19 

Detecting End-of-File and Moving the File Pointer.A-20 

Increasing the Size of a File.A-20 

Extended File Manipulation.A-21 

Performing Extended Arithmetic.A-22 

Getting Information About PFS File Systems.A-23 

Managing Pthread Execution. A-24 

Managing Pthread Attributes.A-24 

Managing Mutexes.A-25 

Using Condition Variables to Synchronize Pthreads.A-26 

Canceling Pthreads.....A-26 

Pthreads Cleanup Routines. A-27 

Managing Pthread Keys.A-27 

Miscellaneous Pthread Calls.A-27 


xxiii 




































Table of Contents 


Paragon™ User's Guide 


Fortran System Call Summary. . ............. a-28 

Process Characteristics.............. A-28 

Synchronous Send and Receive.............. A-29 

Asynchronous Send and Receive ...........A-30 

Probing for Pending Messages ...........A-31 

Getting Information About Pending or Received Messages ... A-31 

Treating a Message as an Interrupt.......... A-32 

Extended Receive and Probe.............. A-33 

Global Operations.......A-35 

Controlling Application Execution........ A-38 

Getting Information About Applications .........A-39 

Partition Management......... A-40 

Finding Unusable Nodes .......A-42 

Handling Errors.......;.....A-42 

Floating-Point Control ...........A-42 

Miscellaneous Calls ................A-43 

iPSC® and Touchstone DELTA Compatibility ........A-43 

I/O Modes...................A-45 

Reading and Writing Files in Parallel...........A-45 

Detecting End-of-File and Moving the File Pointer........... A-47 

Flushing Fortran Buffered I/O............A-47 

Increasing the Size of a File ...........A-47 

Extended File Manipulation............ A-48 

Performing Extended Arithmetic.............. A-49 


xxiv 



























Paragon™ User's Guide 


Table of Contents 


Appendix B 

iPSC® System Compatibility 

Introduction .b-i 

General Compatibility Issues .b-i 

New Features .b-2 

Compilers .B-4 

Commands .b-5 

Cube Control Commands.B-5 

CFS Commands.B-7 

System Administration Commands.B-7 

Remote Host Commands.B-8 

Miscellaneous Commands. B-8 

System Calls . b-9 

Include Files. B-9 

Host Calls.B-9 

Byte-Swapping Calls.B-14 

Floating-Point Control Calls.B-15 

CFS Calls.B-15 

Miscellaneous Calls. B-16 

Summary ...b-i 7 


XXV 





















Table of Contents 


Paragon™ User’s Guide 


List of Illustrations 

Figure 1 -1. Front Panel LEDs (Paragon™ XP/S Only) ......1 -3 

Figure 1-2. Node Activity LEDs... ....1-4 

Figure 2-1. The Root Partition of a 32-Node System .. 2-27 

Figure 2-2. Node Numbers in Contiguous and Noncontiguous Partitions. 2-31 

Figure 2-3. Node Numbers in Overlapping Partitions .. ...2-32 

Figure 4-1. Sample Partition for nx_part_attr() and nx_part_nodes() .. 4-34 

Figure 7-1. Using Domain Decomposition to Achieve Load Balancing. 7-4 

Figure 7-2. The Decomposition Used for the pi Example. 7-9 

Figure 7-3. The N-Queens Solution Tree for a 4 x 4 Board... 7-15 

Figure 8-1. Two Methods of Improving I/O Performance with M_RECORD .. 8-27 


xx vi 













Paragon™ User's Guide 


Table of Contents 


List of Tables 

Table 2-1. Summary of Scheduling Types.2-38 

Table 5-1. File Operations that Accept “###” Filenames.5-31 

Table 5-2. OSF/1 Calls Not Supporting Extended Files.5-34 

Table 5-3. OSF/1 Commands Not Supporting Extended Files.5-35 

Table 5-4. Synchronization in Each I/O Mode.5-48 

Table 5-5. File I/O Calls that Synchronize.5-48 

Table 6-1. Calls in Reentrant C Library (libc_r.a).6-7 

Table 8-1. Message-Passing Configuration Switches.8-20 

Table A-1. Commands for Compiling and Linking Applications.A-1 

Table A-2. Commands for Running Applications.A-2 

Table A-3. Commands for Managing Partitions.A-2 

Table A-4. Parallel File System Commands.A-3 

Table A-5. Miscellaneous Commands.A-3 

Table A-6. C Calls for Process Characteristics. A-4 

Table A-7. C Calls for Synchronous Send and Receive.A-5 

Table A-8. C Calls for Asynchronous Send and Receive.A-6 

Table A-9. C Calls for Probing for Pending Messages. A-7 

Table A-10. C Calls for Getting Information About Pending or Received Messages.A-7 

Table A-11. C Calls for Treating a Message as an Interrupt.A-8 

Table A-12. C Calls for Extended Receive and Probe.A-9 

Table A-13. C Calls for Global Operations.A-10 

Table A-14. C Calls for Controlling Application Execution.. A-12 

Table A-15. C Calls for Getting Information About Applications.A-13 

Table A-16. C Calls for Partition Management...A-14 

Table A-17. C Calls for Finding Unusable Nodes.A-15 

Table A-18. C Calls for Handling Errors.A-15 


xxvii 





























Table of Contents 


Paragon™ User’s Guide 


List of Tables 

Table A-19. C Calls for Floating-Point Control............ A-16 

Table A-20. Miscellaneous C Calls................. A-16 

Table A-21. C Calls for iPSC® and Touchstone DELTA Compatibility......A-17 

Table A-22. C Calls for I/O Modes ......... A-18 

Table A-23. C Calls for Reading and Writing Files in Parallel.. A-19 

Table A-24. C Calls for Detecting End-of-File and Moving the File Pointer... A-20 

Table A-25. C Calls for Increasing the Size of a File... A-20 

Table A-26. C Calls for Extended File Manipulation... A-21 

Table A-27. C Calls for Performing Extended Arithmetic .. A-22 

Table A-28. C Calls for Getting Information About PFS™ File Systems.A-23 

Table A-29. C Calls for Managing Pthread Execution .. A-24 

Table A-30. C Calls for Managing Pthread Attributes... A-24 

Table A-31. C Calls for Managing Mutexes... A-25 

Table A-32. C Calls for Using Condition Variables to Synchronize Pthreads...A-26 

Table A-33. C Calls for Canceling Pthreads .. A-26 

Table A-34. C Calls for Pthreads Cleanup Routines .. A-27 

Table A-35. C Calls for Managing Pthread Keys .. A-27 

Table A-36. Miscellaneous Pthread Calls .. A-27 

Table A-37. Fortran Calls for Process Characteristics .. A-28 

Table A-38. Fortran Calls for Synchronous Send and Receive....... A-29 

Table A-39. Fortran Calls for Asynchronous Send and Receive. A-30 

Table A-40. Fortran Calls for Probing for Pending Messages .. A-31 

Table A-41. Fortran Calls for Getting Information About Pending or Received Messages ..A-31 

Table A-42. Fortran Calls for Treating a Message as an Interrupt.. A-32 

Table A-43. Fortran Calls for Extended Receive and Probe. A-33 

Table A-44. Fortran Calls for Global Operations. A-35 

Table A-45. Fortran Calls for Controlling Application Execution .. A-38 

Table A-46. Fortran Calls for Getting Information About Applications.. A-39 

Table A-47. Fortran Calls for Partition Management.. A-40 

Table A-48. Fortran Calls for Finding Unusable Nodes.. A-42 


xxviii 


































Paragon™ User's Guide 


Table of Contents 


List of Tables 

Table A-49. Fortran Calls for Handling Errors.A-42 

Table A-50. Fortran Calls for Floating-Point Control.A-42 

Table A-51. Miscellaneous Fortran Calls.A-43 

Table A-52. Fortran Calls for iPSC® and Touchstone DELTA Compatibility.A-43 

Table A-53. Fortran Calls for I/O Modes. A-45 

Table A-54. Fortran Calls for Reading and Writing Files in Parallel.A-45 

Table A-55. Fortran Calls for Detecting End-of-File and Moving the File Pointer.A-47 

Table A-56. Fortran Calls for Flushing Buffered I/O.A-47 

Table A-57. Fortran Calls for Increasing the Size of a File.A-47 

Table A-58. Fortran Calls for Extended File Manipulation.A-48 

Table A-59. Fortran Calls for Performing Extended Arithmetic.A-49 

Table B-1. Unsupported iPSC® System Byte-Swapping Calls.B-14 

Table B-2. Summary of Unsupported iPSC® System Commands .B-17 

Table B-3. Summary of Unsupported iPSC® System Calls.B-19 


xxix 

















Table of Contents 


Paragon™ User’s Guide 



xxx 





Introduction 

This chapter introduces the Paragon™ OSF/1 operating system and the hardware it runs on. 

In an Intel supercomputer, a large number of processors called nodes work concurrently on the parts 
of a problem. Each node can run multiple processes, and each process can have multiple threads. 
The processes and threads on each node time-share the node’s processor, using the standard OSF/1 
scheduling mechanisms. Each process can be a stand-alone program (such as a shell, compiler, or 
editor), or can be part of a parallel application. 

A parallel application consists of a group of closely related processes that work together on a single 
problem. They synchronize their actions and share information by passing messages, which are 
created and controlled by special Paragon OSF/1 system calls. 

The processes in an application can also share disk files; Paragon OSF/1 parallel I/O calls insure 
that access to these files is efficient and properly synchronized. 
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System Hardware 

The Paragon OSF/1 operating system runs on several models of Intel supercomputers. These 
systems all have a large number of nodes connected by a high-speed node interconnect network , and 
a number of 110 interfaces to communicate with the outside world. 


Nodes 

Each node is essentially a separate computer, with one or more i860® processors and 16M bytes or 
more of memory. Nodes can run distinct programs and have distinct memory spaces. They can team 
up to work on the same problem and exchange data by passing messages. An Intel supercomputer 
can have up to 2000 nodes. Each node can run more than one process at the same time; these 
processes can belong to the same or different applications. 

The system administrator can choose to dedicate some nodes to interactive processes, such as shells 
and editors, and other nodes to compute-intensive applications. The nodes used for interactive 
processes are called service nodes , and the nodes used for compute-intensive applications are called 
compute nodes. However, there are no physical differences between these two types of nodes. 


Node Interconnect Network 

The nodes are connected by a high-speed node interconnect network. Each node interfaces to this 
network through special hardware that monitors the network and extracts only those messages 
addressed to its attached node. Messages addressed to other nodes are passed on without interrupting 
the node processor. For most applications, you can think of each node as being fully connected to all 
the other nodes. 


I/O Interfaces 

Some nodes are equipped with a SCSI interface, Ethernet interface, or other I/O connection. These 
nodes manage the system’s disk and tape drives, network connections, and other I/O facilities. 
Nodes with I/O interfaces communicate with the other nodes over the node interconnect network. 
However, this access is transparent: processes on nodes without I/O hardware access the I/O 
facilities using standard OSF/1 system calls, just as though they were directly connected. Nodes with 
I/O interfaces are otherwise identical to nodes without I/O interfaces, and can run user processes. 
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Front Panel LEDs (Paragon™ XP/S Only) 

On the Paragon XP/S system, each cabinet has a number of Light-Emitting Diodes (LEDs) on its 
front panel that inform you of the status of the system, the nodes, and messages between nodes. The 
front panel LEDs are shown in Figure 1-1. 


Message going left (yellow) 



Message going down 
(green) 

Node fault (red) 


Node activity (green) 


Figure 1-1. Front Panel LEDs (Paragon™ XP/S Only) 


Each cabinet has four LED panels, each of which shows the status of 16 nodes in a 4 by 4 grid. Figure 
1-1 shows the upper left comer of one LED panel. The meanings of the LEDs are as follows: 


• The round green LED in the upper left comer of the top LED panel in each cabinet indicates 
that power has been supplied to the cabinet. (The corresponding LEDs in the other three panels 
never illuminate.) 


• The round red LED just below the green power LED indicates a fault in the cabinet’s power 
subsystem. If a fault is detected by the cabinet’s self-tests, this LED illuminates. (The 
corresponding LEDs in the other three panels never illuminate.) 

• The square groups of horizontal green LED bars show the amount of computational activity on 
the nodes. Each group represents one node. The more active a node is, the more green LEDs are 
illuminated, in a bar graph moving out from the center. Figure 1-2 shows the six possible ways 
these LEDs can be illuminated, showing activity levels from 0% to 100%. 
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0% 20% 40% 60% 80% 100% 


Figure 1-2. Node Activity LEDs 

• The arrow-shaped yellow and green LED bars indicate messages. When a message is passed 
from one node to another, all the arrow LEDs along its path illuminate. (Messages always travel 
first in the X direction (horizontally), then in the Y direction (vertically). Messages never 
change direction more than once.) Yellow arrows show messages going up or to the left; green 
arrows show messages going down or to the right. When the arrows are illuminated, a light 
pattern moves along the arrow to show the direction of motion. 

• The round red LED associated with each node indicates a hardware fault on the node. If a fault 
is detected by the node’s self-tests, the red LED illuminates. 


System Software 

The nodes run the Paragon OSF/1 operating system, based on the OSF/1 operating system from the 
Open Software Foundation. The same operating system runs on every node. OSF/1 is a version of 
the UNIX operating system that supports most industry standards; Paragon OSF/1 is an extended 
version of OSF/1 with enhancements to support parallel processing. 

The Intel supercomputer also comes with a cross-development facility, which you can use to compile 
and link Paragon OSF/1 programs on supported workstations. 


TM 

Paragon OSF/1 Operating System 

Paragon OSF/1 provides all the standard features of OSF/1, with extensions to provide a single 
system image across multiple nodes. This single system image makes all the nodes appear to be one 
large system. For example, all the nodes share a single file system, all the nodes have equal access 
to the system’s I/O devices, and process identifiers (PIDs) are unique throughout the system. A 
process on one node can pipe its output to a process on another node, and the command kill pid on 
any node kills the specified process, no matter which node the process is running on. 
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The single system image does not combine all the nodes’ memory into a single address space. 
Rather, each process has its own address space. The physical memory available to each process is 
limited to the memory of the node on which it is running. However, because OSF/1 provides virtual 
memory, a process’s address space can be up to 2G bytes in size; memory pages that do not fit in 
physical memory are paged to disk. As in most multi-user systems, the address spaces of the 
different processes on the system are completely independent, unless two or more processes make 
special shared virtual memory calls to explicitly share part of their memory. 

In addition to the standard facilities of OSF/1, the Paragon OSF/1 operating system provides 
message passing capability. Parallel File System access, and various other utilities to programs 
running on the Intel supercomputer. With Paragon OSF/1 calls, your programs can perform the 
following functions: 

• Exchange messages with processes running on other nodes (or the same node). 

• Read and write files on the Intel supercomputer’s Parallel File System. 

• Perform 64-bit integer arithmetic. 

• Fmd out information about the computing environment. 

• Perform global operations. 

• Create and control parallel applications and partitions. 


User Model 

The Paragon OSF/1 operating system is a complete implementation of OSF/1, and provides a full 
range of services, commands, and system calls. It has its own file system, shells, compilers, editors, 
network connections, and all the other features needed in a stand-alone computer system. It also 
supports NFS, the Network File System, so it can share data with other systems on your network. 
You can edit and compile programs, send and receive mail, read online manual pages, and do all 
your other daily work on the Intel supercomputer. 

You access the Intel supercomputer by logging into a separate computer (typically your UNIX 
workstation) and then connecting to the Intel supercomputer ova: a local-area network, using a 
command such as riogin or telnet The Intel supercomputer does not have any dedicated hardware 
terminals. 

You compile and link your application with the self-hosted Paragon OSF/1 compilers and linker. 
You then execute your application on the nodes of the Intel supercomputer simply by typing the 
application’s name on the shell command line. Command-line switches, or arguments to system 
calls in the program, determine the number of nodes on which the application executes. 


1-5 




Introduction 


Paragon™ User’s Guide 


When you run an application, it runs in a partition. A partition is a group of nodes with an associated 
set of parameters that controls some of the run-time characteristics of the applications within it. You 
can use commands or system calls to create, modify, and remove partitions. However, the operations 
you are allowed to perform on your system’s partitions may be restricted by the policies of your site. 

The Paragon OSF/1 operating system also provides a suite of program development tools, such as a 
debugger, profiler, and parallel performance analysis tools. These tools are described in the 
Paragon™ Application Tools User’s Guide. 


Programming Model 

The most common programming model used with Paragon OSF/1 is the “single program, multiple 
data” (SPMD) model. In this model, the same program runs on each node in the application, but each 
node works on only part of the data. 

• For some problems, called “perfecdy parallel” problems, each node can do its work without 
access to data held by other nodes. In this case, each node operates completely independently. 

• For other types of problems, each node needs data from other nodes to do its work. In this case, 
the nodes can share data by passing messages. Messages can also be used to synchronize node 
operations. 

Because each node is an independent computer, you can also use other programming models. One 
example is the “manager-worker” model, in which one “manager” program starts up several 
“worker” programs on other nodes, then gathers and interprets their results. 


Cross-Development Facility 

Paragon OSF/1 comes with a complete program development environment, including compilers, 
linker, libraries, and related tools. You can perform all phases of program development on die Intel 
supercomputer. In addition, the compilers, linker, and libraries for Paragon OSF/1 are also available 
on selected UNIX workstations. This cross-development facility lets you edit, compile, and link 
Paragon OSF/1 programs on your own workstation. 

Note, though, that the cross-development facility does not include a way to run a Paragon OSF/1 
executable that resides on your workstation’s disk. You must transfer your executable files to the 
Intel supercomputer for execution and debugging. You can do this by mounting your workstation’s 
file system onto the Intel supercomputer, or the Intel supercomputer’s file system onto your 
workstation, using the Network File System (NFS). You can also use commands such as rep or ftp 
to copy the executable files to the Intel supercomputer. To execute files on the Intel supercomputer 
once they are transferred, you can use the standard rsh or remd command. 
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This chapter tells you how to use Paragon OSF/1 commands to perform the following tasks: 

• Compiling and linking applications. 

• Running applications. 

• Managing running applications. 

• Managing partitions. 

The commands discussed in this chapter are available to all users. See the Paragon ™ System 
Administrator’s Guide for information on commands that require root privilege. 

This chapter does not discuss NQS, the Network Queueing System, which is used at some sites to 
schedule application execution. See the Paragon ™ Network Queueing System Manual for 
information on NQS. 


Terminology 

This chapter uses the following terms: 

• A parallel application, usually just called an application in this manual, is a group of 
cooperating processes that runs on the nodes of the Intel supercomputer. 

• A program is a file (source or executable). An application consists of one or more programs 
running on one or more nodes. The term program is also used to refer to a non-parallel program 
(an ordinary program that runs on one node). 
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• A partition is a named group of nodes. When you run a parallel application, you must select a 
partition to run it in (if you don’t, it runs in your default partition). The partition places limits 
on some of the execution characteristics of the application, such as how many nodes it can use 
and how long it can use them before it is “rolled out” and another application is “rolled in.” You 
can allocate all of the nodes of the partition to the application, or just some of them. This 
allocation may or may not be exclusive, depending on the characteristics of the partition. 

All Intel supercomputers have two special partitions called the service partition and the compute 
partition. The service partition is used to run non-parallel programs such as shells and editors, 
and the compute partition is used to run parallel applications. The other partitions on your 
system, and what you can do with them, are determined by your system administrator. 


TM 

Using Paragon OSF/1 Commands on the Intel Supercomputer 

The Paragon OSF/1 operating system provides all of the standard commands of OSF/1, such as cat 
and Is, which work as specified by the Open Software Foundation. These commands are not 
described in this chapter; see the OSF/1 Command Reference for information on these commands. 

Paragon OSF/1 also provides several commands that are not specified by the Open Software 
Foundation, such as mkpart and rmpart. These commands are described in this chapter, and 
manual pages for these commands are provided in the Paragon ™ Commands Reference Manual. 

To use any of these commands, you must first log into an Intel supercomputer. Intel supercomputers 
have no directly-attached terminals; you must first log into another system (typically a workstation 
running some variant of the UNIX operating system) and then log into the Intel supercomputer over 
the network, using a command such as rlogin or telnet. Once you have logged in, you use these 
commands in the same way as commands on any other computer running OSF/1. 


Using Paragon™ OSF/1 Commands on Workstations 

The Paragon OSF/1 operating system also comes with several commands that run on workstations 
(for example, the icc and if77 cross-compilers). These commands are described briefly in this 
chapter, complete descriptions and manual pages for these commands are provided in the Paragon ™ 
C Compiler User’s Guide and Paragon M Fortran Compiler User’s Guide. 

To use these commands, you must first log into a workstation on which these commands are 
supported, then configure your account as described under “Configuring Your Environment for 
Cross-Development” on page 2-6. Once you have done this, you can use the Paragon OSF/1 
cross-development commands in the same way as other commands on the workstation. However, if 
you compile an application on a workstation you must transfer the executable file to an Intel 
supercomputer to execute it. Depending on your local configuration, you may be able to use the 
Network File System (NFS), the rep command, the ftp command, or some other technique to do this. 
Ask your system administrator about how files are shared between the Intel supercomputer and other 
systems on your network. 
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A Quick Example 

Here is a quick example that shows you how to compile, link, and execute a simple application on 
an Intel supercomputer. 

Information You Need 

Before you begin, you will need the following information: 

• The network name of your Intel supercomputer. 

• The command to use to log into the Intel supercomputer, such as rlogin or telnet. 

• Your user name and password on the Intel supercomputer (if necessary). 

• The name of the default partition you should use to run parallel applications. 

This information should be available from your system administrator. 

Compiling, Linking, and Executing an Application 

Once you have the necessary information, the procedure to compile, link, and execute an application 
is as follows: 

1. Log into the Intel supercomputer, as instructed by your system administrator. 

2. Set the environment variable NX_DFLT_PART to the name of your default partition: 

• If you use the C shell, use the following command: 

% setenv NX_DFLT_PART partition_name 

• If you use the Bourne or Kom shell, use the following commands: 

$ NX_DFLT_PART-parti ti on_name 
$ export NX_DFLT_PART 
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3. Type in a short program: 

If you are a Fortran programmer, type the following program into the file myapp.fi 

program hello 
include ’fnx.h’ 

write(*,100) mynode() 

100 format(’Hello from node’, i4, ’i’) 

end 

• If you are a C programmer, type the following program into the file myapp.c: 

#include <nx.h> 

main() 

{ 

printf("Hello from node %d!\n", mynode()); 

} 

4. Compile the program into an executable file: 

• If you are a Fortran programmer, use the following command: 

% f77 -nx -o myapp myapp. f 

• If you are a C programmer, use the following command: 

% cc -nx -o myapp myapp.c 

5. Execute the resulting file, myapp , on four nodes with the following command: 

% myapp -sz 4 
Hello from node 0! 

Hello from node 3! 

Hello from node 1! 

Hello from node 2! 

The order in which the output lines appear may vary. 

That’s all there is to it! Of course, Paragon OSF/1 provides many additional commands and switches 
you can use to control the behavior of the compiler and the resulting application. These commands 
and switches are described in the rest of this chapter. 
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Command Synopsis 

Description 

cc -nx [ switches ] sourcefile... 

Compile a Paragon OSF/1 application written 
in C on an Intel supercomputer. 

Til -nx [ switches ] sourcefile... 

Compile a Paragon OSF/1 application written 
in Fortran on an Intel supercomputer. 

icc -nx [ switches ] sourcefile... 

Compile a Paragon OSF/1 application written 
in C on an Intel supercomputer or 
cross-development workstation. 

if77 -nx [ switches ] sourcefile... 

Compile a Paragon OSF/1 application written 
in Fortran on an Intel supercomputer or 
cross-development workstation. 


You can compile and link applications on the Intel supercomputer itself, or on a workstation that 
supports the Paragon OSF/1 cross-development environment. On the Intel supercomputer, you can 
use the “native” commands cc and f77 or the “cross-development” commands icc and if77. On a 
workstation, you must use the cross-development commands icc and if77. The native and 
cross-development versions of each command take the same switches and work identically. 

When compiling and linking an application, you should generally use the switch -nx on the 
command line. The -nx switch has three effects: 

• If used while linking a C or Fortran program, it links in libnx.a, the library that contains all the 
system calls described in this manual. 

• If used while linking a C or Fortran program, it links in a special start-up routine that starts up 
the program on multiple nodes, as specified by standard command line switches and 
environment variables. 

• If used while compiling a C program, it defines the preprocessor symbol_NODE. The 

program being compiled can use preprocessor statements such as #ifdef to control compilation 
based on whether or not this symbol is defined. (This preprocessor symbol is not defined if -nx 
is used while compiling a Fortran program.) 

For example, the following command line compiles and links the file ntyapp.c to create an 
executable file called myapp (on the Intel supercomputer): 

% cc -nx -o myapp myapp.c 
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The following command line has the same effect (on the Intel supercomputer or a cross-development 
workstation): 

% icc -nx -o myapp myapp.c 

NOTE 

Do not use -nx if your application calls nxJnitveQ- 


The Paragon OSF/1 operating system provides nx_initve() and related functions to give your 
application more control over the way it starts up. They let the application perform actions for itself 
that are normally performed for it by -nx. If you link your application with -nx and it also calls 
nx_initveO itself, the application’s call to nx_initve() will fail and return -1. See “Managing 
Applications” on page 4-2 for more information on nx_initve() and related functions. 

To link an application that calls nx_initve(), use the switch -lnx instead of -nx. The -lnx switch links 
in libtix.a, but without the special start-up routine supplied by -nx. A program linked with -lnx can 
use all the calls described in this manual, but does not automatically start itself on multiple nodes. 
(Note that the -lnx switch must appear on the compiler command line after the filenames of any 

source or object files that use these calls.) Note that the preprocessor symbol NODE is not defined 

by-lnx. 


A program that is not linked with -nx and does not call nx_initve() is not a parallel application. It 
does not recognize the command-line switches described under “Running Applications” on page 
2-11, and it always runs on one node in the service partition. (If it creates additional processes by 
calling forkO, they may run on the same node or a different node, but they will always run in the 
service partition.) 


Configuring Your Environment for Cross-Development 

Before you can use the icc and if77 commands on your workstation, you must configure your 
environment as follows: 

• The environment variable PARAGON_XDEV must be set to the pathname of the directory that 
contains the Paragon OSF/1 cross-development facility. If you don’t know this pathname, ask 
your system administrator. 

• Your execution search path {PATH or path variable) must include the directory 

$PARAGON_XDEVIparasonlbin. arch . where arch identifies the architecture of your 
workstation (such as sun4 for a Sun-4 workstation). 
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You should put the definitions of these variables into your .cshrc or .login file (or the equivalent 
start-up file for your shell). For example, suppose the Paragon OSF/1 cross-development facility is 
installed in the directory lusrllocallXDEV. If you use the C shell, you would add these lines to your 
.cshrc file: 

setenv PARAGON_XDEV /usr/local/XDEV 

set path=( $path $PARAGON_XDEV/paragon/bin.'arch' ) 
setenv MANPATH "$(MANPATH):${PARAGON_XDEV}/paragon/man" 

(The curly braces in " $ {MANPATH}: $ { paragon_xdev } /paragon/man " are necessary 
because a colon after a variable name is special to the C shell.) 

Once your environment is properly configured, you can use the icc or if77 command to compile and 
link applications on your workstation. For example, the following command line compiles and links 
the file myapp.f to create an executable file called myapp: 

% if 77 -ox -o myapp myapp.f 

The executable file, myapp, can only be executed on the Intel supercomputer. You can do this by 
putting it in a directory that is shared between your workstation and the Intel supercomputer with the 
Network File System (NFS), or by copying it to the Intel supercomputer with the ftp or rep 
command. If you use the ftp command, the resulting file may not have execute permission; if this 
happens, use the chmod command on the Intel supercomputer to give myapp execute permission. 


NOTE 

The Paragon OSF/1 versions of the compilers are not the same as 
their iPSC® system equivalents. 


If you develop programs for the iPSC series of supercomputers from Intel Corporation as well as for 
Paragon OSF/1, you must be sure that your execution search path {PATH or path variable) is set 
appropriately for your current target system. To compile a program for Paragon OSF/1, the variable 
PARAGON_XDEV must be set appropriately and your execution search path must include 
$PARAGON_XDEVIparaeonlbin. arch : to compile a program for the iPSC system, the variable 
IPSC XDEV must be set appropriately and your execution search path must include 
$IPSC XDEV/i860/bin. arch instead. Be sure that your execution search path does not include both 
these directories at the same time. 
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Tips for Compiling and Linking 

The following sections give you some tips for compiling and linking Paragon OSF/1 applications 
(on either the Intel supercomputer or a cross-development workstation). 


Using Other Switches 

The cc, f77, icc, and if77 commands have a variety of switches to control their operation. For a 
description of these switches and other information on these commands, see the online manual pages 
for the commands or the following printed manuals: 

cc, icc Paragon™ C Compiler User’s Guide. 

f77, if77 Paragon™ Fortran Compiler User’s Guide. 


Including nx.h or fnx.h 

As a general rule, always include the file nx.h in all Paragon OSF/1 C programs. This file contains 
definitions and declarations needed by the Paragon OSF/1 C system calls. Although a specific 
application may not need the definitions and declarations contained in nx.h, the overhead involved 
in including it in all programs is minor. Include it in your C programs as follows: 

#include <nx.h> 

For Fortran programs, the corresponding file is fnx.h. Include it in your Fortran programs as follows: 
include 'fnx.h' 


Specifying include File and Library Pathnames 

The standard include and library directories depend on whether you are using the native 

development commands or the cross-development commands: 

• The native development commands search for include files in the directory lusrlinclude, and 
they search for libraries in the directories lusrlccsllib (searched first) and lusrllib (searched 
second). 

• The cross-development commands search for include files in the directory 
$PARAGON_XDEVIparagonlinclude, and they search for all libraries in the directory 
$PARAGONXDEVIparagonllib-cojf. 
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Note, though, that on the Intel supercomputer the directories lusrlparagonlXDEVIparagonllib-coff 
and lusrlccsllib are identical, the directories / usrlparagonlXDEVIparagonlinclude and lusrlinclude 
are identical, and the default for $PARAGON_XDEV is lusrlparagonlXDEV , so this difference may 
not be significant. 

If you need to include a file that is not in the standard include directory or in the same directory as 
the source file, you must use the -I switch on the compiler command line to identify the nonstandard 
directory. For example, the following command line compiles and links an application that uses 
include files in the directory lusrllocallinclude: 

% icc -ni myapp.c -I/usr/local/include 

If you need to link to a library that is not in one of the standard library directories, then you must 
modify the command line in one of the following ways: 

• Use the -L switch to provide the pathname of the directory in which the library is located. For 
example, the following command line compiles and links an application that depends on the 
library libffi.a located in the directory lusrllocalllib : 

% ice -nx -L/usr/local/lib myapp.c -Ifft 

• Specify the complete pathname of the appropriate library or libraries on the command line. For 
example, the following command line also compiles and links an application that depends on 
the library libfftM located in the directory lusrllocalllib-. 

% if77 -nx myapp.c /usr/local/lib/libfft.a 


Preprocessing a Fortran Program 

If your Fortran program is in a file whose filename ends with an uppercase “.F” (rather than the 
standard lowercase “i”). the if77 command runs a preprocessor (like the standard C preprocessor) 
on the file. This enables you to use lines like the following in a Fortran program: 

#include <file.h> 

#define MAX 87 
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Order of Switches 

Most cc, f77, icc, and if77 switches are not order-sensitive. However, order is important for the -I, 

-L, and -1 switches and for listing libraries when linking. When constructing command lines, keep 

the following guidelines in mind: 

• List include directories (-1 switch) in the order in which they should be searched. The list of 
include directories you specify with -I switches is collected together and used for all source files 
you specify. For example, the following command looks for include files in the directory 
myincludes, then the directory../ includes, and finally the standard include directory when 
compiling a.c, b.c, and c.c: 

% icc a.c -Imyincludes b.c -I. ./includes c.c 

• List libraries in the order in which they should be searched. The Paragon OSF/1 linkers are 
single-pass linkers; they cannot resolve a backward library reference (i.e., a reference to a 
library object that was defined in a library that has already been searched). Note that this means 
that if you use the -tax switch, you should place it after any source files that need it, as follows: 

% if77 -o myapp myapp.f -lnx 

Backward references between objects (.o files), however, are not a problem, as all listed objects 
are linked unconditionally. 

• The -L switch affects only the search path of libraries that are listed after the -L switch. For 
example, the following command searches only the standard library directories for the library 
libnews.a, but searches the directory ..Imylibs (as well as the standard library directories) for the 
library libgx.a: 

% icc -nx mypxog.c -lnews -L. ,/mylibs -lgx 
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• If you specify more than one -L switch, the named directories are searched in reverse order (the 
directory specified by the first -L switch on the command line is searched after the directory 
specified by the second -L switch on the command line). For example: 

% lee -nx myprog.c -laews -L. ./mylibs -lgx -Llocallibs -llocal 

This command searches for libraries as follows: 

It searches only the standard library directories for the library libnews.a. 

It searches the directory Jmylibs and then the standard library directories for the library 
libgx.a. 

It searches the directory locallibsjhen Jmylibs , and then the standard library directories 
for the library liblocaLa. 

Note that the -L switch also affects system libraries; in fact, directories specified by -L are 
searched for system libraries before the standard library directories. 


Running Applications 

Once you have compiled your application into a Paragon OSF/1 executable file (and, if necessary, 

copied the executable to an Intel supercomputer), you run it by typing its name at your Paragon 

OSF/1 shell command prompt, as you would for any other compiled program. 

For example, if myapp is a compiled application, you can execute it with the following command: 
% myapp 

The way the application runs depends on how you linked it and on what system calls it makes: 

• If myapp was linked with the -nx switch, this command runs myapp on your default number of 
nodes in your default partitioa The section “Controlling the Application’s Execution 
Characteristics” on page 2-13 tells you more about the default partition, and about the 
environment variables and command-line switches you can use to control the execution 
characteristics of applications linked with the -nx switch. 

• If myapp was linked with the -Inx switch, this command runs myapp on the nodes and partition 
specified by system calls within the application. The section “Managing Applications” on page 
4-2 tells you how to use these system calls. If myapp does not specify the nodes and partition in 
these calls, it defaults to running on your default number of nodes in your default partitioa If 
myapp does not make any of these calls, it runs on one node in the service partitioa 

• If myapp was linked without the -nx or -lnx switch, it is an ordinary non-parallel program, and 
it runs on one node in the service partitioa 
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If you see the error message “request overlaps with nodes in use,” it means that your default partition 
does not allow overlapping applications and someone else is already running an application in that 
partitioa Try again later, or use a different partition (as described under “Running an Application in 
a Particular Partition” on page 2-22). You can use the pspart command to determine which 
partitions have applications running in them, as described under “Listing the Applications in a 
Partition” on page 2-51. 

If you see the error message “partition permission denied” or “exceeds partition resources,” check 
to be sure the environment variables NX DFLT PART and NX DFLT SIZE are properly defined. 
See “Using the Default Partition” on page 2-14 and “Specifying Application Size” on page 2-15 for 
more information on these variables; see your system administrator for information on the proper 
settings for these variables at your site. 

If you see the error message “error 216 occurred, unknown,” it means that the application was 
compiled on a previous release of the Paragon OSF/1 operating system and uses an out-of-date 
version of the libraries. (Error 216 is “parallel application incompatible with OS release”, but the 
“unknown” message may appear if the application is so out-of-date that it doesn’t know about the 
existence of this error.) If this occurs, recompile the application and try again. 


I/O Redirection 

You can redirect the standard input, standard output, and standard error of an application with the 
usual OSF/1 techniques. For example, the following command redirects the input and output of the 
application myapp: 

% myapp < my file.in > my file.out 

This command runs the application myapp with its standard input redirected from the file myfile.in 
and its standard output redirected to the file myfile.out. 

Note that, by default, all the nodes read and write their standard input, standard output, and standard 
error using PFS I/O mode 0. In mode 0, all file access requests are honored on a first-come, 
first-served basis. You can change this behavior by selecting a different I/O mode; see “Using VO 
Modes” on page 5-13 for more information. The standard input, standard output, and standard error 
are line-buffered by default. This means that if all the nodes write to standard output or standard 
error, the output from all the nodes is intermixed in the output, line by line; if all the nodes read from 
standard input, each line of the input goes to an arbitrary node. 
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Controlling the Application’s Execution Characteristics 


Command Synopsis Description 

application [ -sz size I -sz hXw I -nd hXw:n ] Execute a Paragon OSF/1 application. 
[ -pri priority ] [ -pt ptype ] 

[ -on nodespec ] [ -pn partition ] 

[ mpjwitches ] 

[ V app2 [ -pt ptype ] [ -on nodespec ] ]... 


When you run an application, you can use command-line switches and environment variables to 
control the way the application executes. This section discusses all the switches and environment 
variables except for the mpjwitches, which are used for message-passing performance tuning; for 
information on the mp switches, see “Message-Passing Configuration Switches” on page 8-18. 

Command-line switches can appear in any order on the command line, and may be intermixed with 
application-specific switches and arguments. If you specify the same command-line switch more 
than once in a single command, the last occurrence overrides the earlier ones. For example, the 
following two commands are equivalent: 

% myapp -sz 4 -sz 50 -pri 8 flle.dat 
% myapp -pri 8 -sz 4 file.dat -sz 50 

Each of these commands runs the application myapp, with the argument file.dat, at priority 8 on 50 
nodes of your default partition. 

If the application was linked with the -nx switch, the command-line switches discussed in this 
section are interpreted and removed from the command line before the application starts up. In the 
previous examples, the arguments -pri 8, -sz 4, and -sz SO are interpreted and removed by the -nx 
code; myapp sees only the argument file.dat (if myapp is a C program argc is 2, argv[0] is “myapp”, 
and argv[l] is “file.dat”). 


NOTE 

All the examples in this section assume that myapp was linked 
with the -nx switch. 

An application that is not linked with -nx controls its own execution with system calls, as discussed 
under “Managing Applications” on page 4-2. Such an application may or may not obey the 
command-line switches discussed in this section, depending on how it was programmed. 
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Using the Default Partition 

When you run a parallel application on the Intel supercomputer, it runs in a partition. The partition 
determines the maximum number of nodes used by the application and how the application is 
scheduled, as described later in this chapter. An application stays in the same partition for its entire 
run. 

If you do not specify otherwise, the application runs in the partition specified by the environment 
variable NX_DFLT_PART. If the environment variable NX_DFLT_PART is not set, the application 
runs in the compute partition, a special partition that is present on all Intel supercomputers. The 
partition specified by NX_DFLT_PART (or, if this variable is not set, the compute partition) is called 
your default partition. 

For example, to run the application myapp in your default partition, use the following command: 

% myapp 

This command runs the application myapp in the partition specified by the environment variable 
NX DFLT PART, or in the compute partition if NX DFLT PART is not set. 

If you see an error message such as “partition not found” or “partition permission denied,” ask your 
system administrator what your default partition should be, then use the commands described in the 
next section to set the variable NX_DFLT_PART to that value. You can also use the -pn switch 
(described under “Running an Application in a Particular Partition” on page 2-22) to run an 
application in a different partition. 

For more information about partitions, see “Managing Partitions” on page 2-25. 


Setting Your Default Partition 

The command you use to set or change your default partition depends on which shell you use. 

• If you use the C shell, use the setenv command. For example, if you are a C shell user, the 

following command sets your default partition to mypart: 

% setenv NX_DFLT_PAST mypart 

setenv is a built-in command of the shell; see csh in the OSF/1 Command Reference for more 
information. 

You can put this command in your .login or .cshrc file on the Intel supercomputer to have your 
default partition set to mypart each time you log in. 
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• If you use the Bourne or Kom shell, set the variable and use the export command to make its 

value available to commands other than the shell. For example, if you are a Bourne or Kom shell 
user, the following commands set your default partition to myparr. 

$ NX_DFLT_PAPT-mypart 
$ export NX_DFLT_PART 

You do not have to use the export command each time you set the variable. You only have to 
export a variable once in each login session, export is a built-in command of the shell; see sh 
or ksh in the OSF/1 Command Reference for more information. 

You can put these commands in your .profile file on the Intel supercomputer to have your 
default partition set to mypart each time you log in. 

You can use an absolute or relative partition pathname as the value of NX DFLT PART. For 
example, the following C shell commands are equivalent: 

% setenv NX_DFLT_PAXT myorg.mypart 
% setenv NX_DFLT_PART .compute.myorg.mypart 

See “Partition Pathnames” on page 2-28 for more information on partition pathnames. 

If you use the C or Kom shell, you can create an alias to change your default partitioa For example, 
the following C shell command creates a “setpart” alias that sets your default partition to its 
argument: 

% alias setpart 'setenv NX_DFLT_PART \! *' 


Determining the Current Default Partition 

To find out your default partition once you have set it, use the echo command. For example: 

% echo $NX_DFLT_PABT 
mypart 

This command works the same in any shell. 


Specifying Application Size 

An application’s size is the number of nodes allocated to the application from the partition. The 
processes of the application run only on this set of nodes, and do not exchange messages with 
processes on nodes outside this set. Depending on the characteristics of the partition, this allocation 
may or may not be exclusive: some or all of these nodes may also be allocated to other applications 
and/or other partitions. An application keeps the same size for its entire run. 
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To set an application’s size, use the switch -sz size, where size is any positive integer less than or 
equal to the number of nodes in the partition. For example, to run the application myapp on 64 nodes 
of your default partition, use the following command: 

% myapp -sz 64 

The -sz size switch attempts to allocate a square group of nodes if it can. If this is not possible, it 
attempts to allocate a rectangular group of nodes that is either twice as wide as it is high or twice as 
high as it is wide. If this is not possible, it allocates any available nodes; in this case, nodes allocated 
to the application may not be contiguous (that is, they may not all be physically next to each other). 
If the requested number of nodes is not available, the command fails and the application does not 
run; an error message is printed to explain why the specified number of nodes is not available. 

No matter what the shape of the application, node numbers within the application (as returned by 
mynodeO) will always be sequential from 0. 


Specifying a Rectangle of Nodes 

To force allocation of a contiguous rectangle of a particular size and shape, use the switch -sz hXw, 
where h and vv are positive integers that specify the height and width of the desired rectangle. (You 
can use an uppercase or lowercase letter X between the integers h and w.) For example, to run myapp 
on an 8 by 8 node rectangle of your default partition, use the following command: 

% myapp -sz 8x8 

If successful, this command runs myapp on an 8 by 8 node rectangle of nodes, which could be 
located anywhere within the partition that it fits. If no 8 by 8 node rectangle is available in the default 
partition, the command fails immediately and the application does not run, even if there are 64 nodes 
free in the partition. If this occurs, the command fails with the error message “exceeds partition 
resources” if no such rectangle can be found that fits within the partition, or “request overlaps with 
nodes in use” if the rectangle fits within the partition but some of its nodes are busy). 


Specifying a Particular Rectangle of Nodes 

To force allocation of a contiguous rectangle of a particular size and shape at a particular location 
within the partition, use the switch -nd hXw:n. (This switch is called -nd, rather than -sz, because it 
specifies a particular set of nodes rather than just a size or shape.) 

In the -nd hXw:n switch, h and w are positive integers that specify the height and width of the 
desired rectangle, and n is a positive integer that specifies the node number within the partition for 
the upper left comer of that rectangle. You can use an uppercase or lowercase letter X between the 
integers h and vv. When choosing the value of n, remember that in an m-node partition the nodes are 
numbered left to right and top to bottom from 0 to m-1. 
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For example, to run myapp on an 8 by 8 node rectangle in the upper left comer of your default 
partition, use the following command: 

% myapp -ad 8x8tO 

In this case, if the specified nodes are not available in the default partition, the application fails 
immediately (even if there is a different 8 by 8 node rectangle available). 


Using the Default Size 

If you don’t use the -sz or -nd switch, the application’s size is specified by the environment variable 
NX_DFLT_SIZE, whose value must be a single positive integer. You can use the techniques 
discussed for the NX_DFLT_PART variable in the previous section to get and set the value of the 
NXDFLTSIZE variable. If NXDFLTSIZE is not set, the application runs on all nodes of the 
partition, and its size is set to the size of the partition. The size specified by NX_DFLT_SIZE (or, if 
this variable is not set, the size of the partition) is called your default number of nodes. 

An application can determine its size by calling numnodesO, and each process in the application can 
determine its node number within the application by calling mynodeO. mynodeO returns a node 
number from 0 to one less than the application’s size. (See “Process Characteristics” on page 3-3 for 
more information on these calls.) For example, with -sz 64, -sz 8x8, or -nd 8x8.-0, numnodesO 
returns 64 and mynodeO returns a number from 0 to 63 inclusive. There is no way for an application 
to change its size. 

An application can determine its shape by calling nx_app_rect(), which returns the height and width 
of the rectangle of nodes allocated to the application. If the nodes allocated to the application do not 
form a rectangle, nx_app_rect() returns a height of 1 and a width equal to numnodesO. 
(nx_app_rect() can also be called by the name mypartO for compatibility with the Touchstone 
DELTA System.) 


Specifying Application Priority 

An application’s priority is an integer associated with the application that is used in determining how 
much of a node’s processor time the application gets when the node is allocated to more than one 
application at once. 0 is the lowest priority, and 10 is the highest. 

The application’s priority is only one of several factors that determine how much processor time it 
gets. For example, the application’s processor time can be affected by the priorities of other 
applications in the system and by the effective priority limit of the partition in which the application 
runs. See “Scheduling Characteristics” on page 2-33 for more information. 

To set the priority of the application, use the switch -pri priority, where priority is an integer from 
0 to 10 inclusive. If you don’t use the -pri switch, the application’s priority is set to 5. 
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For example, to run the application myapp with a priority of 6, use the following command: 

% myapp -pri 6 

An application can change its priority by calling nx_pri() (see “Setting an Application’s Priority 
with nx_pri()” on page 4-9 for more information). 


Specifying Process Type 

A process’s process type, or ptype, is an integer associated with the process that differentiates it from 
any otter process in the application that is on the same node. The process’s node number and process 
type together form the process’s “address” for messages within the application. 

To set the process type of each process in the application, use the switch -pt ptype, where ptype is 
an integer from 0 to 2,147,483,647 (2 31 - 1) inclusive. If you don’t use the -pt switch, the process 
type of each process is 0. 

For example, to run the application myapp with a process type of 1 for each process, use the 
following command: 

% myapp -pt 1 

A process can find out its current process type by calling myptypeO- For example, with -pt 1, 
myptypeO returns 1 on all nodes. Once a process’s process type has been set to a valid value, it 
cannot change its process type and no other process in the same application on the same node can 
use that process type for the run of the application. See “Process Characteristics” on page 3-3 for 
information on process types and the myptypeO and setptypeO system calls. 

The -pt switch is most commonly used when running multiple programs in one application, as 
discussed under “Running Applications Consisting of Multiple Programs” on page 2-21. In most 
other circumstances, you can use the default process type of 0. 


Running a Program on a Subset of the Nodes 

Usually you run the same program file on all the nodes allocated to the application from the partition. 
However, you can also run a program on just some of the nodes, leaving the otter nodes vacant for 
other programs. When you do this, the other nodes are allocated to the application, but no processes 
are started on them. 
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To run a program on a subset of the nodes of an application, use the switch -on nodespec, where 
nodespec is one of the following: 

x The node whose node number is x. 

x..y The range of nodes from numbers x to y. 

n The last node of the partition. 

nspec[jispec]... The specified list of nodes, where each nspec is a node specifier of the form 
jt, x..y, or n (no node may appear more than once in this list). Do not put any 
spaces in this list. 

If you don’t use the -on switch, the program is run on all nodes allocated to the application. 


NOTE 

The numbers you use with -on are node numbers within the 
application (which always range from 0 to one less than the size 
of the application), not node numbers within the partition. 


Fbr example, to run the program myapp on the first three nodes of a 20-node application, use the 
following command: 

% myapp -sz 20 -on 0,1,2 

This command creates an application of size 20 in your default partition and runs myapp on nodes 
0,1, and 2 of the application. Within this application, the function niunnodesO returns 20, and the 
function mynodeO returns a number from 0 to 19 inclusive. However, no processes are started on 
nodes 3 through 19. 

You can use the letter n to represent “the last node in the application.” For example, the following 
command creates an application of your default size in your default partition and runs myapp on the 
first and last nodes of the application: 

% myapp -on 0,n 

For example, if your NX DFLT SIZE variable is set to 64 (and there are at least 64 nodes in your 
default partition), this would run myapp on nodes 0 and 63 of the application. 
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You can also use a pair of numbers separated by two periods (x..y) to specify “nodes x through y 
inclusive.” For example, the following command creates an application of size 100 in your default 
partition and runs the program myapp on nodes 10 through 90: 

% myapp -sz 100 -on 10..90 

It doesn’t matter whether y is greater than* or vice versa. For example, the following command also 
creates an application of size 100 in your default partition and runs the program myapp on nodes 10 
through 90: 

% myapp -sz 100 -on 90..10 

These notations can be combined. For example, the following command creates an application of 
your default size in your default partition and runs myapp on all nodes but node 0 of the application: 

% myapp -on 1.,n 

Another example: the following command creates an application of your default size in your default 
partition and runs myapp on node 1, node 3, nodes 5 through 10 inclusive, and the last node of the 
application: 

% myapp -on 1,3,5..10,n 


NOTE 

Do not use -on if you just want to run a single program on a 
specific number of nodes. 


The -on switch is designed to be used when running multiple programs as a single application, as 
discussed in the next section. You can also use the -on switch to run a “manager” program on one 
or a few nodes of an application; the “manager” program can then run “worker” programs on other 
nodes by calling nx_nfork(), nx_load(), or nx_k>adve() (see “Managing Applications” on page 4-2 
for information on these functions). 

The -on switch is not designed to run an application on a particular number of nodes or a particular 
set of nodes. If you want to run an application on a particular number of nodes, use the -sz switch. 
If you want to run an application on a particular set of nodes, allocate a partition containing those 
nodes and run the application on all nodes of that partition (see “Managing Partitions” on page 2-25 
for information on partitions). 

If you use -on when you should be using -sz, the application will be allocated more nodes than it 
needs. Also, if you use -on and do not run a program on every node of the application, global 
operations will hang. (The global operations described under “Global Operations” on page 3-27, 
such as gdsumO, block until they are called by every node in the application. If you run a program 
on only a subset of the nodes, these operations will block forever.) 
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Running Applications Consisting of Multiple Programs 

You can run multiple program files as a single application. For example, you could run two or more 
separate programs on every node (the resulting processes must have different process types, and the 
processes time-share the processor while the application is active). You might also run a manager 
program on one node and worker programs on the other nodes. The programs should be written to 
work together; you would not usually run two arbitrary programs together in one application. 

To run multiple program files as a single application, use the following syntax: 

% file l switches ] [ \; file [ -pt ptype ) ( -on nodespec ] ] . . . 

That is, you use two or more complete commands on one line, separated by an escaped semicolon 
(backslash followed by semicolon). 


NOTE 

The escaped semicolon (\;) must be preceded and followed by a 
space or tab. Otherwise, it will be considered part of the preceding 
or following argument. 


The first file must either have been linked with -nx or must call nx_initve() without overriding the 
command line; the second and subsequent files may have been linked with or without -nx, but must 
not call nx_initve{). 

The command-line switches you can use with the files are different: 

• You can use any application switches (-sz, -pri, -pt, -on, -pn, and mp switches) with the first 
file. The effect of these switches varies according to the switch: 

The -sz, -pri, -pn, and mp_switches switches you use with the first file affect the entire 
applicatioa 

The -pt and -on switches you use with the first file affect the first file only. 

• You can use only the -pt and -on switches with the second and subsequent files. These switches 
affect the associated file only. 

If you run multiple processes on a single node, you must use the -pt switch to specify a unique 
process type for each process. When two or more processes in an application run on the same node, 
each must have a different process type. If you don’t use the -pt switch, each process will have 
process type 0, and you will receive an error message. 
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For example, to run the programs myapp and myapp2 as a single application, use the following 
command: 

% myapp \; myapp2 -pt 1 

This command runs the program myapp with process type 0 and the program myapp2 with process 
type 1 on your default number of nodes in your default partition. 

To run the program manager on node 0 of a 20-node application and the program worker on the 
remaining nodes, use the following command: 

% manager -sz 20 -on 0 \; worker -on l..n 

This command creates an application of size 20 in your default partition. It then runs the program 
manager on node 0 of the application and the program worker on nodes 1 through 19 of the 
application. All the resulting processes have process type 0, but this does not create a conflict 
because manager and worker run on different nodes. 

NOTE 

If you forget the backslash before the semicolon, the first program 
is run as an application by itself and the second program runs after 
the first program finishes. This usually results in unexpected 
behavior from the programs. 


Running an Application in a Particular Partition 

To run an application in a partition other than your default partition, use the switch -pn partition. 
You must have execute permission for the specified partition. The partition specified by -pn 
overrides the value of NX_DFLT_PART, if any. If you don’t use the -pn switch, the application runs 
in your default partition, as described under “Using the Default Partition” on page 2-14. 


NOTE 

If your default number of nodes, as specified by the environment 
variable NX_DFLT_SIZE, is greater than the number of nodes 
available in the specified partition, you may get a “partition 
resources exceeded” or “request overlaps with nodes in use” 
error. 


If you see this error, use the -sz switch or change the value of NX_DFLT_SIZE to specify an 
application size less than or equal to the size of the specified partition. 
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For example, to run the application myapp on your default number of nodes in the partition myparu 
use the following command: 

% myapp -pn mypart 

You can use an absolute or relative partition pathname with -pn (see “Partition Pathnames” on page 
2-28 for information on partition pathnames). For example, the following commands are equivalent: 

% myapp -pn myorg .mypart 
% myapp -pn .compute.myorg.mypart 

For more information about partitions, see “Managing Partitions” on page 2-25. 

Managing Running Applications 

You use the standard OSF/1 techniques to manage running applications. For example, you use your 
interrupt key (usually <Del> or <ctrl-c>) to interrupt arunning application. If you use the C 
shell or Korn shell, you can use your suspend key (usually < Ctrl - z >) to suspend an application, 
and the fg or bg command to resume it. See csh, sh, or ksh in the OSFI1 Command Reference for 
more information on these techniques. 


NOTE 

Interrupting or suspending an application that is “rolled out” will not 
take effect until the application is “rolled in” again. 


Parallel applications can be gang-scheduled to make more efficient use of system resources. In gang 
scheduling, an application is allowed to run for a time period, called the rollin quantum, and then is 
“rolled out” and another application is “rolled in” in its place. If the rollin quantum is long, you may 
not see any response to a < Ctrl -c > or < Ctrl - z > for a long time. See “Scheduling 
Characteristics” on page 2-33 for more information on gang scheduling. 
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You can also use the ps command to determine the status of an application, and the kill command 
to terminate it. For example: 


% myapp & 
[1] 7045 
% ps 




PID TT 

STAT 

TIME 

COMMAND 

5841 p3 

S + 

0:02.50 

-csh (csh) 

7045 p3 

R 

0:00.30 

myapp 

% kill 7045 



% ps 




PID TT 

STAT 

TIME 

COMMAND 

5841 p3 

S + 

0:02.55 

-csh (csh) 


[1] + Terminated myapp 

% 

The ps command shows only processes running in the service partition. See ps and kill in the OSFI1 
Command Reference for more information on these commands. To show processes running in 
partitions other than the service partition, use the pspart command. 

The myapp process that you see in the output of ps is a special process called the controlling process 
that runs in the service partition; you do not see the other application processes in the output of ps. 
However, sending a signal to the controlling process with < Del >, < Ctrl -c >, < Ctrl - z >, or kill 
signals all the processes in the application. See “Managing Applications” on page 4-2 for more 
information on the controlling process. 

If the application was started from the Bourne shell (sh) or from a shell script, you will see two 
processes with the name of the application in the output of ps. One of these two processes is the 
controlling process; the other is another special process, called the shepherd process. The shepherd 
process is necessary for the application; do not kill it. When the application terminates, this process 
will terminate as well. 

To determine which process is which, use the command ps -f and examine the PPID (parent PID) 
fields of the two processes. The shepherd process is the parent of the controlling process. For 
example: 


$ ps -f 
USER 

PID 

PPID 

%CPU 

STARTED 

TT 

TIME 

COMMAND 

chris 

131125 

131124 

0.0 

13:55:51 

P2 

0:00.28 

-sh (sh) 

chris 

131129 

131125 

0.0 

13:56:36 

p2 

0:00.05 

myapp 

chris 

131130 

131129 

0.0 

13:56:36 

P2 

0:00.03 

myapp 


In this case the second myapp process (PID 131130) is the controlling process. The first myapp 
process, PID 131129, is the parent of the controlling process and is therefore the shepherd process. 

You can use the pspart command to determine the status of all the applications in a particular 
partition. See “Listing the Applications in a Partition” on page 2-51 for information on this 
command. 
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You can also use the Interactive Parallel Debugger (ipd) to control the execution of an application, 
down to the machine instruction. See the Paragon Interactive Parallel Debugger Reference 
Manual for information on ipd. 


Managing Partitions 

The nodes of the Intel supercomputer are divided into overlapping groups called partitions. When 
you run a parallel application, you must select a partition to run it in. The partition places limits on 
the execution characteristics of the application, such as which nodes it can use, whether or not it can 
use nodes that are already in use, and how long it can use them before it is “rolled out” and another 
application is “rolled ia” 

Depending on the policies of your site, you may or may not have to know any more about partitions 
than what has been discussed in this chapter so far. 

• At some sites, the system administrator configures all the partitions; ordinary users can simply 
set the NX DFLT PART variable to an appropriate value (or leave it unset and use the compute 
partition) and then forget all about partitions. If your site is like this, you do not have to read this 
section. However, you may wish to read it to help you understand how the system works. 

• At other sites, users create and configure their own partitions. If your site is like this, you should 
read this section. 

This section includes the following information about partitions: 

• Some special partitions that every Intel supercomputer has. 

• Specifying partitions with partition pathnames. 

• The characteristics of a partition. 

• Making partitions with the mkpart command. 

• Removing partitions with the rmpart command. 

• Showing the characteristics of a partition with the showpart command. 

• listing the subpartitions of a partition with the lspart command. 

• Listing the applications in a partition with the pspart command. 

• Changing the characteristics of a partition with the chpart command. 
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Special Partitions 

Every Intel supercomputer has three special partitions: 

• The root partition directly or indirectly contains all the other partitions in the system. It is the 
only partition that does not have a parent partition. 

• The service partition is the partition in which the users’ shells and other commands run. Its 
parent is the root partition. 

• The compute partition is the partition in which parallel applications run. Its parent is also the 
root partition. 

The characteristics of these partitions are determined by the system administrator. In particular, the 
system administrator sets the ownership and permissions of these partitions according to local 
policies. These ownerships and permissions determine whether or not ordinary users can create 
partitions for their own use, or whether they must run applications in partitions provided for them by 
the system administrator. If ordinary users are allowed to create partitions, the system administrator 
can also place restrictions on the characteristics of partitions they create and the use of certain 
application switches within partitions. 

Typically, the service partition and compute partition are the only two children of the root partition 
and do not overlap. However, the system administrator can choose to configure these partitions 
differently, and may also create additional child partitions of the root partition. 

For example, some systems have an HO partition : a third child of the root partition, which does not 
overlap with either the service or compute partitions, and which contains the nodes that control disks 
and other I/O devices. In other systems, the I/O “partition” is not a true partition, but a set of nodes 
in the root partition that are not part of either the service or the compute partition. 


The Root Partition 

The root partition is the basis for all other partitions. The name of the root partition is . (dot). 

The toot partition contains every usable node in the system. Depending on the underlying hardware, 
there may be unusable nodes within the root partition as well. The root partition organizes all the 
nodes in the system into a two-dimensional grid, or mesh. For example. Figure 2-1 shows the root 
partition of a 32-node system that is configured as a 4 by 8 node mesh. The nodes are numbered from 
Oto31. 


NOTE 

The root partition is always rectangular. (This is not true of 
partitions other than the root partition.) 
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Figure 2-1. The Root Partition of a 32-Node System 


For example, a system with 31 nodes would also be a 4-by-8-node rectangle, numbered as shown in 
Figure 2-1, but one of the nodes would be an unusable node, as described under “Unusable Nodes” 
on page 2-31. You would not be able to start any processes or allocate any subpartitions using this 
node. 


The Service Partition 

The service partition is the partition in which the users’ shells, OSF/1 commands, and other 
non-parallel programs run. The name of the service partition is service. The service partition may 
not contain any subpartitions. 

When you log into the Intel supercomputer, a shell is started for you on a node in the service 
partition; when you execute a command in this shell, the command runs on a node in the service 
partitioa Note that the node the command runs on is not necessarily the same node that the shell runs 
on; the system starts each new process on the node that is currently the least busy. 


The Compute Partition 

The compute partition is the partition in which parallel applications run. The name of the compute 
partition is compute. 
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When you execute a parallel application, one process (called the controlling process ) runs in the 
service partition; the other processes of the application run in the compute partition, or in a 
subpartition of the compute partition. You can specify which partition an application runs in when 
you execute it. 

Your system administrator determines whether or not you can create subpartitions in the compute 
partition and whether or not you can execute applications in the compute partition itself. There may 
also be other local policies that affect how you use the compute partition; for example, you may be 
required to run your applications in certain subpartitions during the day and others at night. 


Partition Pathnames 

Since partitions have a hierarchical structure like directories, they also have pathnames like 
directories. Like a file or directory pathname, a partition pathname identifies a partition within the 
hierarchical partition structure by describing the path from a known location to the specified 
partition. 

Unlike file and directory pathnames, however, partition pathnames use a dot (.) instead of a slash 
(/) to separate the elements of the pathname. This is why the name of the root partition is . (dot). 
There is also no special partition pathname for “current partition” or “parent of the current partition.” 
Also, you cannot use wildcards (* and ?) in partition pathnames. 

There are two types of partition pathnames: 

• An absolute partition pathname specifies the path from the root partition to the specified 
partition. An absolute partition pathname begins with a dot (.) 

• A relative partition pathname specifies the path from the compute partition to the specified 
partitioa A relative partition pathname does not begin with a dot. 


NOTE 

Relative partition pathnames are always relative to the compute 
partition (there is no “current partition”). 


The absolute partition pathnames of the root partition, service partition, and compute partition are 
. (dot), .service, and .compute respectively. Because these partitions are not subpartitions of the 
compute partition, they do not have relative partition pathnames. 

If the partition mypart is a subpartition of the compute partition, its absolute partition pathname is 
,compute.mypart and its relative partition pathname is just mypart. 

If subpart is a subpartition of mypart, its absolute partition pathname is .compute.mypart.subpart 
and its relative partition pathname is mypart.subpart. 
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Partition Characteristics 

Each partition has the following characteristics: 

• A parent partition that contains it. 

• A name that identifies it. 

• A set of nodes that is allocated to it. 

• An owner and group and a set of protection modes , like those of a file or directory, that 
determine what actions a given user is allowed to perform on it. 

• A set of scheduling characteristics that determine how applications are scheduled in it. 

A partition’s characteristics are set when the partition is created. The mkpart command, described 
under “Making Partitions” on page 2-39, lets you specify most of these characteristics on the 
command line; if you don’t specify otherwise, the characteristics of a new partition are set to the 
same values as those of its parent partitioa 

You can use the showpart command, described under “Showing Partition Characteristics” on page 
2-46, to determine a partition’s current characteristics. 

A partition’s parent partition and nodes cannot be changed. You can change the other characteristics 
with the chpart command, described under “Changing Partition Characteristics” on page 2-54. 


Parent Partition 

Each partition is contained within another partitioa The containing partition is called the parent 
partition, and the contained partition is called a child partition or subpartition of the parent partition. 
(There is one exception to this rule: the root partition has no parent) 

You specify a partition’s parent when you create it with mkpart The parent partition determines the 
set of nodes that are available to be allocated to the new partition (a partition cannot include any 
nodes other than the nodes of its parent). The parent partition also determines the default 
characteristics of the new partition, as mentioned earlier. A partition’s parent does not change for the 
life of the partition. 
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Partition Name 

Each partition is identified by a name. A partition’s name must be unique among all the partitions 
with the same parent. Partition names can be any length, but must consist of only uppercase letters 
(A-Z), lowercase letters (a-z), digits (0-9), and underscores (_). 

You specify a partition’s name when you create it with mkpart, and you can use chpart to change 
an existing partition’s name (you must have write permission on the partition’s parent partition). 


Nodes Allocated to the Partition 

Each partition has a set of nodes allocated to it from its parent partition. Depending on the 
characteristics of the parent partition, this allocation may or may not be exclusive: some or all of 
these nodes may also be allocated to other partitions and/or applications. The number of nodes in 
this set is called the partition’s size. 

You can specify the set of nodes allocated to the partition when you create it with mkpart. You can 
specify the partition’s size and let the operating system select the nodes, or you can specify certain 
node numbers from the parent partition. If you don’t specify either, the new partition consists of all 
the nodes of the parent partition. 

The set of nodes allocated to a partition does not change for the life of the partition (that is, partitions 
never move or change their size or shape). Depending on how you allocate the nodes, they may or 
may not be contiguous (all adjacent to each other). Figure 2-2 shows examples of contiguous and 
noncontiguous partitions. 


Node Numbers Within a Partition 

Each node in a partition has a node number within the partition: an integer from 0 to one less than 
the partition’s size. The nodes in a partition are typically numbered from left to right and then from 
top to bottom, as shown in Figure 2-2. 


NOTE 

Because partitions can overlap, a single physical node can have 
many logical node numbers. 


For example. Figure 2-3 shows two partitions, called Partition A and Partition B, that have the same 
parent partition. Partition A consists of nodes 1 through 4 of the parent partition, and Partition B 
consists of nodes 4 through 8 of the parent partition. In this case, node 4 of the parent partition is also 
known as node 3 of Partition A and node 0 of Partition B. 
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Figure 2-2. Node Numbers in Contiguous and Noncontiguous Partitions 

Unusable Nodes 

Occasionally a node may become unusable because of a hardware or software failure. If this occurs, 
the node is still allocated to any partitions to which it was allocated before it became unusable, but 
no applications can be run on that node and no new partitions can include that node until the node 
becomes usable again. The showpart and lspart commands indicate if there are any unusable nodes 
in a partition. 

For example, suppose you make a partition containing 20 nodes and later one of those nodes 
becomes unusable. If you attempt to run an application or make a subpartition with all 20 nodes of 
this partition while the node is unusable, the attempt will fail. 
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Figure 2-3. Node Numbers in Overlapping Partitions 

Owner, Group, and Protection Modes 

Each partition has an owner, a group, and a set of protection modes, like those of a file or directory, 
that determine who can perform what operations on the partition. 


When you create a partition with mkpart, you become the new partition’s owner, the new partition’s 
group is set to your current group (see newgrp in the OSF/1 Command Reference for more 
information on groups). If you are the owner of a partition, you can use chpart to change an existing 
partition’s group; only the system administrator can change an existing partition’s ownership. 
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A partition’s protection modes consist of three groups of three permission bits (read, write, and 
execute for owner, read, write, and execute for group; and read, write, and execute for “other”), as 
described for the chmod command in the OSF/1 Command Reference. The read, write, and execute 
permission bits have the following meanings for a partition; 

r (read) Allows listing the subpartitions and characteristics of the partition. 

w (write) Allows creating and removing subpartitions in the partition and changing the 

partition’s characteristics. 

x (execute) Allows executing applications in the partitioa 

The system administrator (root) is not affected by these permission bits, root can do anything to any 
partition at any time. 

The permission bits can be expressed as a three-digit octal number (as for the chmod command) or 
as a string of the form rwxrwxrwx (as used by the Is -1 command, where a letter represents a bit 
that is “on” and a dash (-) represents a bit that is “off’). For example, the octal number 754 is 
equivalent to the string rwxr-xr- -; both grant all permissions to the owner, read and execute 
permissions to the group, and read permission only to all other users. 

When you create a partition with mkpart, you can specify its protection modes. If you don’t specify 
a partition’s protection modes when you create it, they are set to the same values as those of the 
parent partitioa If you are the owner of a partition or the system administrator, you can use clipart 
to change an existing partition’s protection modes. 


Scheduling Characteristics 

Each partition has a set of scheduling characteristics that determine how the applications running in 
the partition are scheduled (that is, how the system arbitrates between processes when there are 
several processes running on a single node). 

You can specify a partition’s scheduling characteristics when you create it with mkpart and change 
them with chpart. If you don’t specify a partition’s scheduling characteristics when you create it, 
they are set to the same values as those of the parent partition. 
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A partition uses one of three different forms of scheduling: standard scheduling , gang scheduling , 
or space sharing. 

• Partitions that use standard scheduling use the standard OSF/1 scheduling mechanisms. This 
gives good response to user input, but may result in poor performance for parallel applications 
(when one process in the application becomes inactive, other processes that depend on that 
process for information have to wait until it becomes active again). 

• Partitions that use gang scheduling use a modified scheduling mechanism that makes all the 
processes in a parallel application active at the same time. Also, where standard scheduling 
swaps processes in and out frequently (typically every 100 milliseconds), gang scheduling 
swaps applications in and out on the basis of the partition’s rollin quantum: a time period that 
can be up to 24 hours long. A long rollin quantum gives good performance for parallel 
applications, because the application can run for a long time without being interrupted, but may 
result in poor response to user input (when you give input to an application that is rolled out, the 
application does not respond until it is rolled in again). 

• Partitions that use space sharing allow only one application per node. When you run an 
application in a space-shared partition, the partition checks to see if another application or 
partition is already using the requested nodes. If any of the nodes are in use, your application 
fails immediately with the error message “request overlaps with nodes in use.” However, if all 
the specified nodes are available, your application begins running immediately and continues 
running, without interruption, until it completes. 

Standard-scheduled partitions should be used to run interactive applications and applications that are 
being debugged; gang-scheduled and space-shared partitions should be used to run non-interactive 
(typically either computationally-intensive or I/O-intensive) applications. 

The following sections give you more information about these three forms of scheduling. 


Standard Scheduling 

Standard scheduling is the same as the scheduling technique used on single-processor OSF/1 
systems. Standard scheduling is always used in the service partition. 

In a partition that uses standard scheduling, each node is scheduled like a separate computer, there 
is no attempt to coordinate related processes running on separate processors. 


NOTE 

A partition that uses standard scheduling may not contain 
subpartitions, and may not overlap any other partitions that use 
standard scheduling. 
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In a partition that uses standard scheduling, each process has a priority , a number from -20 (high 
priority) to 20 (low priority), that is used in determining how much processor time the process gets. 

Partitions that use standard scheduling give good interactive performance for each individual 
process in the partition. However, there is no guarantee that related processes are active at the same 
time. This means that a process in a parallel application running in such a partition may find itself 
waiting for a message from a process that is not active, which reduces the performance of the 
application. To avoid this problem, you can use gang scheduling. 


Gang Scheduling 

Gang scheduling is a special scheduling technique that coordinates the scheduling of related 
processes running on separate processors. Gang scheduling is typically used only in the compute 
partition, or is not used at all (this is determined by your system administrator). 

In a partition that uses gang scheduling, the nodes are scheduled so that all the processes in an 
application are active at the same time. If there are multiple processes per node in the active 
application, standard scheduling is used to schedule these processes against each other while the 
application is active. 

Partitions that use gang scheduling may contain subpartitions, and may overlap other partitions of 
any type. 

In a partition that uses gang scheduling, not only does each process have a priority, but there is a 
separate priority for the application as a whole. An application’s priority is a number from 0 (low 
priority) to 10 (high priority). A gang-scheduled partition also has a priority of its own, as well as 
two other quantities called the effective priority limit and the rollin quantum: 

• A partition’s priority is the lower of the following: 

The priority of the highest-priority application or subpartition in the partition. 

The partition’s effective priority limit. 

• A partition’s effective priority limit is a number from 0 to 10 that places an upper limit on the 
partition’s priority. It does not affect the priorities of applications or partitions within the 
partitioa 

• A partition’s rollin quantum is the amount of time each application in the partition is allowed to 
be active before the system considers running another application instead. The term “rollin 
quantum” comes from the application being “rolled in” when it is made active, and “rolled out” 
when it is made inactive. 

A gang-scheduled partition’s effective priority limit and rollin quantum are set when the partition is 
created, and do not vary unless you change them with the chpart command. A gang-scheduled 
partition’s priority may vary over time, depending on the priorities of the applications and 
subpartitions in the partition. 
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A partition that uses standard scheduling does not have an effective priority limit or rollin quantum. 
It also does not have a numeric priority; instead, its priority is “infinite” (that is, higher than the 
priority of any gang-scheduled partition or application). 

Gang scheduling is performed recursively, partition by partition. For each gang-scheduled partition 
in the system, starting with the root partition, the operating system examines all the entities 
(applications and partitions) within the partition: 

1. Entities that do not overlap other entities (that is, they have no nodes in common with any other 
entity in the partition) are simply scheduled to run for the partition’s rollin quantum. 

2. Where two or more entities overlap, the priorities of the overlapping entities are compared, and 
the highest-priority entity is scheduled to run for the partition’s rollin quantum. 

3. If two or more entities overlap and are tied for highest priority, they are scheduled in a 
round-robin fashion (each takes turns running for one full rollin quantum). 

4. If an entity that is scheduled to run is a partition, the operating system examines and schedules 
the entities in the partition as described above. This process continues recursively as necessary. 

At the end of each partition’s rollin quantum, the operating system examines and schedules the 
entities in the partition again. 

Note that rules 2 and 3 mean that, when applications or partitions overlap, the one with the highest 
priority gets one rollin quantum after another until it completes. Entities with lower priorities get no 
processor time at all until the higher-priority entity has completed. If there is a tie for highest priority, 
the tied high-priority entities take turns running, but entities with lower priority get no processor time 
until all the high-priority entities complete. Partitions that use standard scheduling always have the 
highest priority, so if a standard-scheduled partition overlaps a gang-scheduled partition or an 
application, the standard-scheduled partition always wins. 

NOTE 

Use of gang scheduling may be limited by the policies of your site. 


Your system administrator can require all compute partitions to use space sharing. If gang 
scheduling is allowed, the administrator can restrict the number of gang-scheduled partitions in the 
system, can set a minimum rollin quantum, and can restrict the number of applications that can 
overlap in each gang-scheduled partition. If you try to create a partition that would exceed these 
restrictions, you see an error message such as “exceeded allocator configuration parameters” or 
“scheduling parameters conflict with allocator configuration.” See your system administrator for 
information on the policies in force at your site. 
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Space Sharing 

Space sharing , also referred to as tiling, is a scheduling technique that prevents partitions and 
applications from overlapping. (Overlapping means having any physical nodes in common.) Space 
sharing is typically used in all partitions other than the service and compute partitions. If your system 
administrator has disallowed gang scheduling, space sharing is used in all partitions other than the 
service partition. Within a space-shared partition: 

• Subpartitions may not overlap other subpartitions. 

• Applications may not overlap other applications. 

• Active subpartitions may not overlap applications. 

An active subpartition is a subpartition in which one or more applications is running. 


NOTE 

If an application is running anywhere in a subpartition or any of its 
sub-subpartitions—even on a single node—the entire subpartition 
is considered active, and is not allowed to overlap with a running 
application. 


If a subpartition is not active (contains no running applications), it can overlap a running application, 
but it cannot overlap another partition. 

In a space-shared partition, any attempt to create a partition or run an application that would cause 
an overlap fails immediately. However, once an application is successfully started, it continues 
running without interruption until it completes. (Exception: if a space-shared partition overlaps with 
another partition, the entire partition can be interrupted by applications running in that other 
partition. This can only occur if the space-shared partition’s parent is a gang-scheduled partition. 

Space sharing is the opposite of the “time sharing” used in standard scheduling and gang scheduling. 
In time sharing, multiple applications can use the same nodes at the same time, but each application 
gets only a fraction of its nodes’ processor time. In space sharing, no two applications can use a node 
at the same time, but each application gets 100% of its nodes’ processor time. 

Although space sharing allows only one application per node, you can have more than one process 
per node within a single applicatioa If there are multiple processes per node within an application, 
standard scheduling is used to schedule these processes against each other on each node. 

Partitions that use space sharing may contain subpartitions, which cannot overlap. The space-shared 
partition itself can overlap another partition of any type, but the advantages of space sharing may be 
lost if space-shared partitions overlap with other partitions. 
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like gang-scheduled partitions, space-shared partitions have a priority and an effective priority 
limit. Each application within a space-shared partition has a priority from 0 to 10, and the partition’s 
priority is the lesser of the effective priority limit and the highest application priority in the partition. 
Since applications in space-scheduled partitions never overlap, their priorities are never compared 
with each other. However, the priorities of applications in a space-scheduled partition are important 
because they determine the partition’s priority when compared with other partitions at its own 
hierarchical level. 

Unlike gang-scheduled partitions, space-shared partitions do not have a rollin quantum (since 
applications never overlap, they never have to be rolled in or out). In effect, the rollin quantum of a 
space-shared partition is “infinite.” 

Summary of Scheduling Types 

Table 2-1 summarizes the differences between the three scheduling types. 


Table 2-1. Summary of Scheduling Types 


Characteristic 

Standard Scheduling 

Gang Scheduling 

Space Sharing 

Scheduling method used 
within partition 

Each process is scheduled 
by itself using standard 
UNIX techniques 

All processes in an 
application run at the 
same time; applications 
may be rolled in and out 

All processes in an 
application run at the 
same time; each 
application runs until it 
completes 

Partitions that typically 
use this scheduling type 

Service partition 

Compute partition, or 
none at all 

All other partitions 

Restrictions on partition 
overlap 

Partition may not overlap 
other standard-scheduled 
partitions 

Partition may overlap 
other partitions 

Partition may overlap 
other partitions (but 
overlap can lose benefits 
of space sharing) 

Restrictions on 
subpartition overlap 

Subpartitions are not 
allowed 

Subpartitions may 
overlap; maximum depth 
of overlap can be 
restricted by system 
administrator 

Subpartitions may not 
overlap other 
subpartitions; active 
subpartitions may not 
overlap applications 

Restrictions on 
application overlap 

Applications may overlap 
to any depth 

Applications may overlap; 
maximum depth of 
overlap can be restricted 
by system administrator 

Applications may not 
overlap other applications 
or active subpartitions 

Special partition 
characteristics 

Partition priority 
(always “infinite”) 

Partition priority, 
effective priority limit, 
rollin quantum 

Partition priority, 
effective priority limit 
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A Scheduling Example 

Suppose that a partition has 10 nodes, and an application is currently running on 5 of those nodes. If 
you attempt to run a new application on 6 nodes of that partition, the results depend on the partition’s 
scheduling type: 

• If the partition uses standard scheduling, both applications run at once. Where the applications 
overlap, the two applications’ processes time-share the node. No attempt is made to coordinate 
when the processes are active with the rest of the application. 

• If the partition uses gang scheduling, the two applications’ priorities are compared: 

If the new application’s priority is greater than the old application’s, the entire old 
application is immediately rolled out and the new application starts running. The new 
application runs until it finishes, then the old application is rolled back in. 

If the new application’s priority is less than the old application’s, the entire new application 
waits until the old application finishes. (During this time it may appear to be “hung.”) When 
the old application finishes, the new application is rolled in and runs until it finishes. 

If the two applications’ priorities are equal, the applications alternate running on each rollin 
quantum. If one application finishes first, the other runs in every rollin quantum until it 
finishes. 

• If the partition uses space sharing, the new application fails with the error message “request 
overlaps with nodes in use” and does not run. 

You can use the pspart command to determine which applications are currently running in a 
partition and what their priorities are, and you can use the command showpart -f to determine which 
nodes in a partition have applications running on them. 


Making Partitions 


Command Synopsis Description 

mkpart [ -sz size I -sz hXw I -nd nodespec ] Create a partition. 

[ -ss I [[ -sps I -rq time ] [ -epl priority ] ] ] 

[ -mod mode ] name 


To create a partition, use the mkpart command. You can specify either a relative or an absolute 
partition pathname for the new partition. The specified new partition must not exist; the parent 
partition of the new partition must exist and must grant you write permission. 
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For example, to create a partition called mypart whose parent partition is the compute partition, you 
can use the following command: 

% mkpart mypart 

The following command has the same effect, but uses an absolute partition pathname: 

% mkpart .compute .mypart 


Specifying the Nodes Allocated to the Partition 


The mkpart command gives you three ways to specify which nodes are allocated to the new 
partition: 

-sz size Creates a partition whose size (number of nodes) is size. The -sz size switch 

attempts to create a square partition if it caa If this is not possible, it attempts 
to create a rectangular partition that is either twice as .wide as it is high or 
twice as high as it is wide. If this is not possible, it uses any available nodes. 
In this case, the nodes allocated to the partition may not be contiguous. 

-sz hXw Creates a contiguous rectangular partition that is h nodes high and w nodes 

wide. (You can use an uppercase or lowercase letter X between the integers h 
and w.) 


-nd nodespec Creates a partition that consists of exactly the specified nodes, where 
nodespec is one of the following: 

x The node whose node number is jc. 

x..y The range of nodes from numbers x to y. 

hXwm The rectangular group of nodes that is h nodes high 

and w nodes wide and whose upper left comer is node 
number n. (You can use an uppercase or lowercase 
letter X between the integers h and w.) 


nspec[pispec] ... The specified list of nodes, where each nspec is a node 
specifier of the form x, x..y, or hXw:n (no node may 
appear more than once in this list). Do not put any 
spaces in this list 

The numbers you use with -nd are node numbers within the parent partition, 
which always range from 0 to one less than the size of the partition. 

If you don’t use the -sz or -nd switch, all the nodes of the parent partition are allocated to the new 
partition. You can use at most one -sz or -nd switch in a single mkpart command. 
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The following examples all create a 50-node partition called mypart whose parent partition is the 
compute partition (that is, the new partition’s absolute partition pathname is. compute.mypart ): 

• This command creates a 50-node partition with no specified shape or location: 

% mkpazt -sz 50 mypaxt 

The nodes of the new partition are selected from the parent partition by the system, and they 
may not be contiguous. 

• This command creates a partition 10 nodes high and 5 nodes wide: 

% mkpazt -sz 10x5 mypaxt 

The position of the new partition within the parent partition is selected by the system, but the 
new partition is a contiguous rectangle. 

• This command creates a partition 10 nodes high and 5 nodes wide located in the upper left 
comer of the parent partition: 

% mkpazt -nd 10X5sO mypaxt 

The shape and position of the new partition are specified by the user, and the new partition is a 
contiguous rectangle. 

• This command creates a partition that consists of nodes 30 through 79 of the parent partition: 

% mkpazt -nd 30..79 mypaxt 

The specific nodes of the partition are specified by the user, and the new partition may or may 
not be contiguous (its shape depends on the size and shape of the compute partition). 

• This command creates a partition that consists of node 0, nodes 3 through 16, and a 5 by 7 node 
rectangle located at node 21 of the parent partition: 

% mkpazt -nd 0,3..16,5X7t21 mypaxt 

The specific nodes of the partition are specified by the user, and the new partition is not 
contiguous (its shape depends on the size and shape of the compute partition). 
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No matter how you specify the partition’s size, nodes are always numbered from 0 to one less than 
the partition’s size. In most cases, they are numbered from left to right and then top to bottom, as 
discussed under “Nodes Allocated to the Partition” on page 2-30. However, if you use the -nd 
switch, the nodes in the new partition are numbered in the order you specified them in the -nd switch. 
For example, the following command creates a partition that consists of nodes 30 through 79 of the 
compute partition; 

% mkpart -nd 79..30 mypart 

In this case, node 79 of the parent partition is node 0 of the new partition; node 78 of the parent 
partition is node 1 of the new partition; and so on to node 30 of the parent partition, which is node 
49 of the new partitioa 


Specifying Protection Modes 

The mkpart command gives you two ways to specify the protection modes of the new partition: 

-mod mn Creates a partition whose protection modes are specified by the three-digit 
octal number win, as used by the chmod command (see chmod in the OSFI1 
Command Reference for more information). 

-mod string Creates a partition whose protection modes are specified by the 

nine-character string string. The string must have the form rwxrwxrwx, 
where a letter (r, w, or x) represents a permission granted and a dash (-) 
represents a permission denied, as displayed by the command Is -1 (see Is in 
the OSFI1 Command Reference for more information). 

You can use at most one -mod switch in a single mkpart command. If you don’t use the -mod 
switch, the new partition is given the same protection modes as its parent partition. 

For example, the following command creates a partition that is readable, writable, and executable by 
you; readable and executable by your group, and only readable by others: 

% mkpart -mod rwxr-xr — mypart 
The following command has the same effect, but uses an octal number 

% mkpart -mod 754 mypart 
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Specifying Scheduling Characteristics 

The mkpart command gives you three switches to specify the scheduling characteristics of the new 
partition; 

-ss Creates a partition that uses standard scheduling. 

-ss cannot be used with -sps, -rq or -epl. 

-rq time Creates a partition that uses gang scheduling with a rollin quantum of time. 


where time is one of the following: 

n 

n milliseconds (if n is not a multiple of 100, it is 
silently rounded up to the next multiple of 100). 

m 

n seconds. 

nm 

n minutes. 

nh 

n hours. 

0 

“Infinite” time: once rolled in, an application runs until 
it exits. 


The maximum rollin quantum is 24 hours; the minimum rollin quantum for 
your system is determined by your system administrator. 

-rq cannot be used with -ss or -sps. -rq can be used with or without -epl; if 
you use -rq without -epl, the new partition is a gang-scheduled partition with 
the same effective priority limit as its parent partition. 

If gang-scheduled partitions are not allowed at your site, or creating a 
gang-scheduled partition would exceed the maximum number of 
gang-scheduled partitions, any attempt to create a partition with -rq fails. 

-sps Creates a partition that uses space sharing. 

-sps cannot be used with -ss or -rq. -sps can be used with or without -epl; if 
you use -sps without -epl, the new partition is a space-shared partition with 
the same effective priority limit as its parent partition. 


2-43 



Using Paragon™ OSF/1 Commands 


Paragon™ User’s Guide 


-epl priority Creates a partition with an effective priority limit of priority, where priority 
is an integer from 0 to 10 inclusive (0 is low priority, 10 is high priority). 

-epl cannot be used together with -ss. If you use -epl without either -sps or 
-rq, the results depend on the scheduling type of the parent partition: 

• If the parent partition is a space-shared partition, the new partition is a 
space-shared partition with the specified effective priority limit. 

• If the parent partition is a gang-scheduled partition, the new partition is 
a gang-scheduled partition with the specified effective priority limit and 
the same rollin quantum as its parent. If this would exceed the maximum 
number of gang-scheduled partitions, the new partition is a space-shared 
partition instead... 

If you don’t use the -ss, -rq, or -sps switch, the new partition uses the same scheduling technique, 
rollin quantum, and effective priority limit as its parent partitioa 

For example, the following command creates a partition that uses standard scheduling: 

% mkpart -ss mypart 

The following command creates a partition that uses gang scheduling with a rollin quantum of 10 
seconds and the same effective priority limit as its parent partition: 

% mkpart -rq 10s mypart 

The following command creates a partition that uses space sharing with the same effective priority 
limit as its parent partitioa 

% mkpart -sps mypart 

The following command creates a partition that uses gang scheduling with a rollin quantum of 5 
minutes and an effective priority limit of 6: 

% mkpart -rq 5m -epl 6 mypart 
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Removing Partitions 


Command Synopsis Description 

impart [ -f ] [ -r ] partition Remove a partition. 


To remove an existing partition, use the rmpart command. You must have write permission on the 
parent partition of the partition to be removed. You can specify the partition to be removed with 
either a relative or an absolute partition pathname. 

For example, to remove the partition called mypart, whose parent partition is the compute partition, 
you can use the following command: 

% rmpart mypart 

The following command has the same effect, but uses an absolute partition pathname: 

% rmpart .compute.mypart 


Removing Partitions Containing Running Applications 

If you specify a partition that contains any running applications, you see an error message and the 
partition is not removed. You can force impart to remove a partition that contains running 
applications with the -f switch. When you use the -f switch, rmpart terminates all the applications 
running in the specified partition and then removes it. 

For example, if there are applications running in mypart, use the following command to terminate 
the applications and remove the partition: 

% rmpart -f mypart 


Removing Partitions Containing Subpartitions 

If you specify a partition that contains any subpartitions, you see an error message and the partition 
is not removed. You can force rmpart to remove a partition that contains subpartitions with the -r 
switch. When you use the -r switch, rmpart recursively removes all the subpartitions in the 
specified partition (and their sub-subpartitions, and so on) and then removes the specified partition. 
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For example, if there are subpartitions in mypart, use the following command to remove mypart and 
all its subpartitions: 

% impart -r mypart 

impart -r is an “all or nothing” operation. If any subpartitions cannot be removed, the command 
fails and no subpartitions are removed. 

The -r switch does not imply -f. If mypart or any of its subpartitions contains any running 
applications, you see an error message and none of the partitions are removed. You can force impart 
to remove a partition that contains subpartitions and running applications by using the -r and -f 
switches together. When you use both these switches, impart terminates all the applications running 
in the specified partition and its subpartitions, removes all the subpartitions in the specified partition, 
and then removes the specified partition. 


Showing Partition Characteristics 

Command Synopsis Description 

showpart [ -f ] [ partition ] Show the characteristics of a partition. 


To show the characteristics of a partition, use the showpart command. You can specify the partition 
with either a relative or an absolute partition pathname. If you don’t specify a partition, showpart 
shows the characteristics of your default partition (see “Using the Default Partition” on page 2-14). 
In either case, you must have read permission on the specified partition. 

For example, to show the characteristics of the partition called mypart , whose parent partition is the 
compute partition, you can use the following command: 

% showpart mypart 

USER GROUP ACCESS SIZE FREE RQ EPL 

smith eng 777 9 5 15m 5 

+ - + 

01 .... | 

41 . * * * | 

81 . * * * | 

12 | . * * * | 

+ - + 

The following command has the same effect, but uses an absolute partition pathname: 

% showpart .compute.mypart 
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The columns at the top of the showpart output have the following meanings: 
user The owner of the partition, in this case smith. 

GROUP The group of the partition, in this case eng. 

access The access permissions, expressed as an octal number, in this case 777 (which 

represents the permissions rwxrwxrwx). 

size The number of nodes in the partition, in this case 9. 

The number of free nodes in the partition, in this case 5 (see “Showing Free 
Nodes” on page 2-48 for more information on free nodes). 

The rollin quantum or scheduling type of the partition, as follows: 

The partition uses standard scheduling. 

S PS The partition uses space sharing. 

time The partition uses gang scheduling with a rollin 

quantum of time. The time is expressed as a number 
followed by an optional letter: no letter for 
milliseconds, s for seconds, m for minutes, or h for 
hours. 

In this case, the partition is a gang-scheduled partition with a rollin quantum 
of 15 minutes. 

EPL The effective priority limit of the partition, in this case 5, or a dash (-) for a 

standard-scheduled partition. 

See “Partition Characteristics” on page 2-29 for information on these partition characteristics. 

The rectangular picture at the bottom of the showpart output shows the size, shape, and position of 
the specified partition within the system: 

• The large rectangle represents the root partition. In this case, the root partition is 4 nodes high 
and 4 nodes wide. 

• The numbers to the left of the rectangle show the node numbers of the nodes in the first column 
of each row. In this case, the first node in the top row is node 0, the first node in the second row 
is node 4, the first node in the third row is node 8, and the first node in the bottom row is node 12. 


FREE 

RQ 
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• Asterisks (*) within the rectangle represent nodes that are allocated to the specified partition; 
periods (.) represent otter nodes. In this case, mypart consists of nodes 5-7,9-11, and 13-15 
of the root partition. 

• If you see a dash (-) or an x within the rectangle, it represents an unusable node that is allocated 
to the specified partition. You cannot run any applications or allocate any partitions using this 
node. See “Unusable Nodes” on page 2-31 for more information. 


Showing Free Nodes 

The output of Ispart or showpart includes the number of free nodes in the FREE column. A node 
is free if no application is running on that node and no subpartition in which any applications are 
running includes that node. (Note that all the nodes of a subpartition are considered busy if an 
application is running anywhere in the subpartition, or in any of its sub-subpartitions. This occurs 
because partitions are scheduled recursively.) 

You can use the -f switch of showpart to see which nodes are free. The output of showpart -f is the 
same as the regular showpart output, except that free nodes are shown as an F instead of an asterisk. 

For example, to show the free nodes in the partition called mypart, whose parent partition is the 
compute partition, you can use the following command: 

% showpart -f mypart 
USER GROUP ACCESS SIZE 

smith eng 777 9 

+-+ 

0 | .... | 

41 . * * * I 

81 . * F F | 

12| . F F F | 

+.+ 

In this case, mypart has five free nodes: nodes 4,5,6,7, and 8 of the partition. 


FREE RQ EPL 
5 15m 5 
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Listing Subpartitions 


Command Synopsis Description 

Ispart [ -r ] [ partition ] List the subpartitions of a partitioa 


To list the subpartitions of a partition with their characteristics, use the ispart command. You can 
specify the partition with either a relative or an absolute partition pathname. If you don’t specify a 
partitioa Ispart lists the subpartitions of your default partition (see “Using the Default Partition” on 
page 2-14). In either case, you must have read permission on the specified partitioa 

For example, to list the subpartitions of the partition called mypart, whose parent partition is the 
compute partition, you can use the following command: 

% Ispart mypart 


USER 

GROUP 

ACCESS 

SIZE 

FREE 

RQ 

EPL 

PARTITION 

chris 

eng 

111 

16 

4 

15m 

3 

mandelbrot 

chris 

eng 

111 

16 

16 

- 

- 

debug 

pat 

mrkt 

755 

4 

0 

SPS 

10 

slalom 

* 

* 

* 

* 

* 

* 

* 

private 


The following command has the same effect, but uses an absolute partition pathname: 

% Ispart .compute.mypart 

The columns in the output of Ispart are the same as the top part of the output of showpart (see 

“Showing Partition Characteristics” on page 2-46), with the addition of the partition name. In this 

case, mypart has four subpartitions: mandelbrot, debug, slalom, and private. 

• mandelbrot is owned by user chris in group eng; it has permissions rwxrwxrwx and a size of 
16 nodes, of which 4 are free (see “Showing Free Nodes” on page 2-48 for more information on 
free nodes). It is a gang-scheduled partition with a rollin quantum of 15 minutes and an effective 
priority limit of 3. 

• debug is also owned by user chris in group eng; it has permissions rwxrwxrwx and a size of 
16 nodes, of which all 16 are free. It is a standard-scheduled partition, so it has no rollin quantum 
or effective priority limit 

• slalom is owned by user pat in group mrkr, it has permissions rwxr - xr - x and a size of 4 
nodes, of which none are free. It is a space-shared partition with an effective priority limit of 10. 

• private’s access permissions do not grant you read permission, so all its characteristics are 
shown as asterisks (*). 
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If you see two numbers separated by a slash in the SIZE column, it indicates that one or more of the 
nodes allocated to the indicated partition is unusable. For example: 

% lspart mypart 
USER GROUP 

chris eng 

This indicates that there are 16 nodes allocated to mcmdelbrot, but 2 of them are currently unusable. 
You cannot run any applications or allocate any partitions using unusable nodes. See “Unusable 
Nodes” on page 2-31 for more informatioa 


ACCESS SIZE FREE RQ EPL PARTITION 

777 14/16 10 15m 3 mandelbrot 


Recursively Listing Subpartitions 

To recursively list all of a partition’s subpartitions, sub-subpartitions, and so on, use the -r switch. 
Fbr example: 

% lspart -r mypart 


USER 

GROUP 

ACCESS 

SIZE 

FREE 

RQ 

EPL 

PARTITION 

compute. mypart: 







chris 

eng 

111 

16 

4 

15m 

3 

mandelbrot 

chris 

eng 

111 

16 

16 

- 

- 

debug 

pat 

mrkt 

755 

4 

0 

SPS 

10 

slalom 

* 

* 

* 

* 

* 

* 

* 

private 

compute. mypart. mandelbrot: 






chris 

eng 

111 

16 

16 

15m 

10 

hi_pri 

chris 

eng 

111 

16 

16 

15m 

1 

lo_pri 


The lspart -r output reveals that mypart.mandelbrot has two subpartitions, hi_pri and lo_pri , neither 
of which has any sub-subpartitions, and that slalom and debug have no subpartitions. No information 
is available on the subpartitions of private (if any), because private does not grant you read 
permission. 


NOTE 

If you specify a partition that has no subpartitions, lspart produces 
no output. 


For example, since mypart.slalom has no subpartitions, an lspart command on this partition gives 
no output: 

% lspart mypart.slalom 
% 

To get information about mypart.slalom itself, use the showpart command. 
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Listing the Applications in a Partition 


Command Synopsis Description 

pspart [ -r ] [ partition ] List the applications in a partition. 


To list the applications in a partition, with information about the rollin/rollout status of each, use the 
pspart command. You can specify the partition with either a relative or an absolute partition 
pathname. If you don’t specify a partition, pspart lists the applications in your default partition (see 
“Using the Default Partition” on page 2-14). In either case, you must have read permission on the 
specified partition. 

For example, to list the applications in the partition mypart, whose parent partition is the compute 
partition, you can use the following command: 


% pspart mypart 


PGID 

USER 

SIZE 

PRI 

START 

TIME ACTIVE 

TOTAL TIME 

COMMAND 

12345 

pat 

256 

5 

11:42:20 

45.00 

75% 

0:04:41 

mag -sz 256 

23456 

chris 

67 

4 

Jan 21 

- 

- 

0:12.30 

boggle 

34567 

smith 

192 

10 

02:21:51 

0:01:00 

100% 

2:12:03 

myfft 


The following command has the same effect, but uses an absolute partition pathname: 


% pspart .compute.mypart 


The columns in the output of pspart have the following meanings: 


PGID 


USER 

SIZE 

PRI 

START 


The process group ID of the application (see “Process Groups” on page 4-22 
for more information). 

The process group ID of an application is always the same as the process ID 
of the application’s controlling process. This means that you can use this 
number with the kill command to kill the application; for example, given the 
pspart output above, the command kill 34567 would kill the application 
myfft. 

The login name of the user who invoked the applicatioa 

The number of nodes allocated to the application from the partition (see 
“Specifying Application Size” on page 2-15 for more information). 

The application’s priority (see “Specifying Application Priority” on page 
2-17 for more information). 

The time the application was started. If the application was started more than 
24 hours ago, the date it was started is shown instead. 
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TIME active The amount of time the application has been active (rolled in) in the current 
rollin quantum (see “Gang Scheduling” on page 2-35 for more information). 
The time active is shown both as an absolute time (in the format 
minutes: seconds. milliseconds for times less than one minute or 
hours: minutes: seconds for times of one minute or more) and as a percentage 
of the partition’s rollin quantum. If the application is not active in the current 
rollin quantum, a dash (-) is shown for both quantities. If the partition uses 
space sharing, the time shown is the total amount of time the application has 
been running and the percentage is always 100%. 

In the example above, the partition mypart is a gang-scheduled partition with 
a rollin quantum of one minute. The application mag has been active for 45 
seconds, or 75% of the rollin quantum; the application boggle is not currently 
active; and the application myfft has been active for one minute, or 100% of 
the rollin quantum. 

total TIME The total amount of time the application has been rolled in since it was 
started, in the format minutes : seconds . milliseconds or 
hours: minutes : seconds. If the partition uses space sharing, the TOTAL 
TIME is always the same as the TIME ACTIVE. 

In the example above, the application mag has been active for a total of 4 
minutes and 41 seconds; the application boggle has been active for a total of 
12.30 seconds; and the application myfft has been active for a total of 2 hours, 
12 minutes, and 3 seconds. 

COMMAND The command line by which the application was invoked. 


Applications in Subpartitions 

If there are any applications running in subpartitions of the specified partition, the subpartitions 
appear in the output of pspart as follows: 

% pspart mypart 


PGID 

USER 

SIZE 

PRI 

START 

TIME ACTIVE 

TOTAL TIME 

COMMAND 

12345 

pat 

256 

5 

11:42:20 

45.00 75% 

0:04:41 

mag -sz 256 

23456 

chris 

67 

4 

Jan 21 

- 

0:12.30 

boggle 

34567 smith 
Active Partitions 

192 

10 

02:21:51 

0:01:00 100% 

2:12:03 

myfft 

OWNER 

GROUP 

SIZE 

PRI 

START 

TIME ACTIVE 

TOTAL TIME 

NAME 

smith 

eng 

64 

6 

09:16:30 

- 

1:18.10 

subpart 
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The columns for the list of active partitions have the following meanings: 

owner The owner of the subpartition. 

group The group of the subpartition. 

SIZE The size of the subpartition (note that all nodes of a subpartition containing 

an active application are considered active, even if not all the nodes in the 
subpartition are actually in use by applications). 

PRI The current priority of the subpartition (this is the highest priority of all the 

applications in the subpartition or the subpartition’s effective priority limit, 
whichever is lower). 

s TART The time or date when the oldest application in the subpartition was started. 

TIME active The amount of time the subpartition has been active (rolled in) in the current 
rollin quantum. 

TOTAL TIME The total amount of time the subpartition has been rolled in since it was 
started. 

name The name of the subpartition. 

See “Scheduling Characteristics” on page 2-33 for more information on how subpartitions are 
scheduled. 
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Recursively Listing Applications in Subpartitions 

If there are applications running in a subpartition, the output of pspart normally shows only that the 
subpartition is active. To list the applications in subpartitions (and, recursively, in sub-subpartitions 
and so on), use the -r switch. For example: 


% pspart -r mypart 
mypart: 


PGID 

USER 

SIZE 

PRI 

START 

TIME ACTIVE 

TOTAL TIME 

COMMAND 

12345 

pat 

256 

5 

11:42:20 

45.00 75% 

0:04:41 

mag -sz 256 

23456 

chris 

67 

4 

Jan 21 

- 

0:12.30 

boggle 

34567 

smith 

192 

10 

02:21:51 

0:01:00 100% 

2:12:03 

myf ft 

Active Partitions 







OWNER 

GROUP 

SIZE 

PRI 

START 

TIME ACTIVE 

TOTAL TIME 

NAME 

smith 

eng 

64 

6 

09:16:30 

- 

1:18.10 

subpart 

mypart.subpart: 







PGID 

USER 

SIZE 

PRI 

START 

TIME ACTIVE 

TOTAL TIME 

COMMAND 

45678 

smith 

56 

7 

09:16:30 

- 

1:18.10 

span 


In this case, the -r switch shows that the subpartition subpart has one application, span, which is 
running on 56 nodes of the subpartition. (Note that even though the application is not running on 
every node of the subpartition, whenever the application is rolled in the entire subpartition is rolled 
in. This occurs because subpartitions are scheduled recursively, as discussed under “Gang 
Scheduling” on page 2-35.) 


Changing Partition Characteristics 


Command Synopsis Description 

chpart [ -rq time I -sps ] [ -epl priority ] Change certain partition characteristics. 

[ -nm name ] [ -mod mode ] 

[ *g group ] [ -o owner[. group] ] 
partition 


To change the characteristics of a partition, use the chpart command. The permissions required 
depend on the switches you use. You can specify the partition with either a relative or an absolute 
partition pathname. 
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chpart can change the following partition characteristics: 

• Rollin quantum. 

• Effective priority limit. 

• Partition name. 

• Protection modes. 

• Owner and group. 

• Scheduling type (space-shared to gang-scheduled, or gang-scheduled to space-shared with 
certain limitations; a partition cannot be changed to or from standard scheduling). 

A partition’s size and parent partition are determined when the partition is created and cannot be 
changed. 

The switches of chpart, which can be used together or separately and in any order (except as noted 
below), are similar to the corresponding switches of mkpart: 

-rq time Changes the partition to a gang-scheduled partition with a rollin quantum of 

time, where time is one of the following: 


n 

n milliseconds (if n is not a multiple of 100, it is 
rounded up to the next multiple of 100). 

ns 

n seconds. 

nm 

n minutes. 

nh 

n hours. 

0 

“Infinite” time: once rolled in, an applicationruns until 
it exits. 

The maximum rollin quantum is 24 hours; the minimum rollin quantum for 
your system is determined by your system administrator. 

-rq can be used only on a gang-scheduled or space-shared partition, and 
cannot be used together with -sps. To use -rq, you must have write 


permission on the specified partition. 
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-sps Changes the partition to a space-shared partition. 

-sps can be used only on a space-shared or gang-scheduled partition, and 
cannot be used together with -rq. If the partition is currently gang-scheduled, 
it must not contain any overlapping subpartitions or any applications. To use 
-sps, you must have write permission on the specified partition. 

-epl priority Changes the partition’s effective priority limit to priority , where priority is an 

integer from 0 to 10 inclusive. 

-epl can be used only on a gang-scheduled or space-shared partition. To use 
-epl, you must have write permission on the specified partition. 

-nm name Changes the partition.’ s name to name, where name is a valid partition name 
(a string of any length containing only uppercase letters, lowercase letters, 
digits, and underscores). To use -nm, you must have write permission on the 
parent partition of the specified partition. 

Note that -nm can only change the partition’s name “in place;” there is no 
way to move a partition to a different parent partition. 

-mod nm Changes the partition’s protection modes to the value specified by the 

three-digit octal number mn. To use -mod, you must be the owner of the 
specified partition or the system administrator. 

-mod string Changes the partition’s protection modes to the value specified by the 

nine-character string string. The string must have the form rwxrwxrwx, 
where a letter (r, w, or x) represents a permission granted and a dash (-) 
represents a permission denied. To use -mod, you must be the owner of the 
specified partition or the system administrator. 

-g group Changes the partition’s group to group. The group can be either a group name 

or a numeric group ID. To use -g, you must be the owner of the specified 
partition and a member of the specified new group, or you must be the system 
administrator. 

-o owner[. group] Changes the partition’s owner to owner. If . group is specified, also changes 
the partition’s group to group. The owner and group can be either user/group 
names or numeric user/group IDs. To use -o, you must be the system 
administrator. 
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For example, the following command changes the rollin quantum of mypart to 20 minutes: 

% chpart -rq 20m mypart 
The following command changes mypart to a space-shared partition 
% chpart -sps mypart 

The following command changes the effective priority of mypart to 2: 

% chpart -epl 2 mypart 

The following command changes the protection modes of mypart so that it is readable, writable, and 
executable by the owner but not by anyone else: 

% chpart -mod rwx - mypart 

The following command has the same effect as the previous three commands combined, but uses an 
absolute partition pathname and an octal protection mode specifier: 

% chpart -epl 2 -rq 20m -mod 700 .compute .mypart 

The following command changes the owner of mypart to smith , but does not affect its group: 

% chpart -o smith mypart 

The following command changes the group of mypart to support , but does not affect its ownership: 
% chpart -q support mypart 

The following command changes the owner of mypart to smith and the group to support 
% chpart -o smith. support mypart 
The following command changes the name of mypart to newpart 
% chpart -nm newpart mypart 

The following command also changes the name of mypart to newpart , but uses an absolute partition 
pathname: 

% chpart -nm newpart .compute.mypart 
Note that the new name is specified as a name only, not a pathname. 
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1 


Using Paragon™ OSF/1 
Message-Passing System Calls 



Introduction 

Message passing is the standard means of communication among processes in Paragon OSF/1. As 
independent processor/memory pairs, the nodes do not share physical memory. If the node processes 
need to share information, they can do so by passing messages. The calls described in this chapter 
let your programs send and receive messages. 

This chapter introduces the Paragon OSF/1 message-passing system calls. It includes the following 
sections: 

• Process characteristics. 

• Message characteristics. 

• Names of send and receive calls. 

• Synchronous send and receive. 

• Asynchronous send and receive. 

• Probing for pending messages. 

• Getting information about pending or received messages. 

• Message passing with Fortran commons. 

• Treating a message as an interrupt 

• Extended receive and probe. 

• Global operations. 
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Within each section, the calls are discussed in order of increasing complexity. That is, the “base” 
calls are discussed first, and the “extended” calls are discussed later. 

Each section includes numerous examples in both C and Fortran. A call description at the beginning 
of each section or subsection gives a language-independent synopsis (call name, parameter names, 
and brief description) of each call discussed in that section. Differences between C and Fortran are 
noted where applicable. See Appendix A for information on call and parameter types; see the 
Paragon ™ C System Calls Reference Manual or the Paragon Fortran System Calls Reference 
Manual for complete information on each call. 

This chapter does not describe all the Paragon OSF/1 system calls. For information about system 
calls that provide general services other than message passing, see Chapter 4. For information about 
the calls used with the Parallel File System, see Chapter 5. For information about the calls used with 
graphical interfaces, such as DGL and the X Window System, see the Paragon ™ Graphics Libraries 
User’s Guide. For information about the system calls that require root privileges, see the Paragon ™ 
System Administrator’s Guide. 

Paragon OSF/1 programs written in C can also issue OSF/1 system calls. The Paragon OSF/1 
operating system is a complete OSF/1 system and fully supports all the standard OSF/1 system calls. 
See the OSF/1 Programmer’s Reference for information on these calls. 

Paragon OSF/1 programs written in Fortran cannot make OSF/1 system calls directly, but the 
Fortran runtime library includes a number of system interface routines. These routines make a 
number of OSF/1 system calls available to Fortran programs. See the Paragon ™ Fortran Compiler 
User’s Guide for information on these routines. 
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Process Characteristics 


Each process within an application is identified by its node number and process type. A process must 
have a valid node number and process type to send and receive messages. 


Node Numbers 


Synopsis 

Description 

mynodeO 

Obtain the calling process’s node number. 

numnodesO 

Obtain the number of nodes allocated to the 
current application. 


A process’s node number is an integer that identifies the node on which it is running. Node numbers 
are assigned by the system, and range from zero to one less than the number of nodes in the 
application. A process can find out its node number by calling mynodeO; the node number does not 
change for the life of the process. A process can also find out the number of nodes in the application 
by calling numnodesO; the maximum node number in the application is numnodesO -1. 

When you run an application that was linked with the -nx switch, the system creates one process on 
each node of the default partition (unless you specify otherwise on the application’s command line). 
Each process is the same as the others except for its node number, which is different in each process. 

All message-sending system calls have a node parameter that specifies the node to which the 
message is sent. You can use any valid node number, or the special value -1 to send the message to 
all nodes in the application except the sending node itself. 

Some message-receiving system calls have a nodesel parameter that specifies the node from which 
the message was sent. A nodesel parameter can be a valid node number (to receive only messages 
from that node), or the special value -1 (to receive messages from any node). Message-receiving 
system calls that do not have a nodesel parameter always receive messages from any node. 

The node numbers used in message-passing calls are always node numbers within the application, 
not physical slot numbers or node numbers within the partition in which the application is running. 
Fbr example, if you run an application on 30 nodes of a 64-node partition by using the switch -sz 30, 
the node numbers within the application will always be 0 through 29. However, those nodes might 
not be nodes 0 through 29 of the partition. They might be nodes 0 through 29, or 10 through 39, or 
a completely arbitrary set of nodes. 
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Process Types 


Synopsis 

Description 

myptype() 

Obtain the calling process’s process type. 

setptyp e(ptype) 

Set the calling process’s process type (only 


permitted if the process type is currently 


INVALIDPTYPE). 


A process’s process type , or ptype, is an integer that distinguishes the process from other processes 
in the same application running on the same node. Process types are assigned by the user, and can 
be any integer from 0 to 2,147,483,647 (2 31 - 1) inclusive. A process can find out its process type 
by calling myptypeO- A process cannot change its process type once it has been set to a valid value. 

When you run an application that was linked with -nx, the system sets the process type of all 
processes in the application to the value you specify with the -pt switch on the application’s 
command line (default 0). 

All message-sending system calls have a ptype parameter that specifies the process type to which the 
message is sent. You must specify the process type; you cannot use -1. 

Some message-receiving system calls have a ptypesel parameter that specifies the process type from 
which the message was sent A ptypesel parameter can be a valid process type (to receive only 
messages from that process type), or the special value -1 (to receive messages from any process 
type). Message-receiving system calls that do not have a ptypesel parameter always receive 
messages from any process type. 

Certain system calls that involve all the nodes in the application, called global operations, require 
that every node in the application has one process with the same process type. All these processes 
must call the global operation before the application can proceed. 

Within a single application, multiple processes running on the same node must have different 
process types. However, processes on different nodes may (and usually do) have the same process 
type. Two processes running on a single node may have the same process type only if they belong 
to different applications. 


NOTE 

The -pt switch (or, if not specified, the default process type of 0) 
applies only to the process type of the initial processes created by 
running the application. 
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If an application creates additional processes after it starts up, and no process type is specified for 
the new process, the new process’s process type is set to the special value INVALIDPT YPE (a 
negative constant defined in the header file nx.h). A process whose process type is 
INVALIDPTYPE cannot send or receive messages. It must use the system call setptypeO to set 
its process type to a valid value before it can send or receive any messages. (This is the only valid 
use of setptype().) 

The Paragon OSF/1 system calls that create node processes (nx_nforkO, nx_loadO, and 
nx_loadve()) have a ptype parameter that specifies the process type of the newly-created processes. 
However, the standard OSF/1 system call fork(), which creates a new process on the same node as 
the process that calls it, does not provide any way to specify the new process’s process type. This 
means that the process type of a process created by forkO is set to INV ALIDPTYPE. The new 
process must call setptypeO before it can send or receive messages. The specified process type must 
be different from the parent’s, and different from the process type of any other process in the same 
application on the same node. 

A process’s process type is inherited across an exec(). This means that if you do a forkO followed 
by an execO, you can call setptypeO either before or after the exec(). However, the setptypeO must 
follow the fork(). 

Once a process has used a process type, that process type is associated with the process for the life 
of the application. No other process on the same node in the same application can ever use that 
process type, even if the original process terminates. 

If a process has multiple pthreads, all the pthreads in the process have the same process type. See 
Chapter 6 for information on pthreads. 


Message Characteristics 

Messages are characterized by a length , a type , and sometimes an ID. These characteristics are set 
when the message is sent, and do not change for the life of the message. 


Message Length 

The length of a message is the number of bytes of information contained in the message. Messages 
in Paragon OSF/1 can be of any length. 

All message-passing system calls have a count parameter that specifies the length of the message to 
be sent or received. The length you specify must be less than or equal to the size in bytes of the buffer 
used in the call. Message-sending calls read exactly that number of bytes from the buffer and send 
them as a message; message-receiving calls generate an error if a message is received that is larger 
than the specified length. 
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If you program in C, when you send a message you can use the sizeof operator to determine the size 
of your message in bytes. If you program in Fortran, you will need to add up the sizes of all the data 
elements within the message; see the Paragon ™ Fortran Compiler User’s Guide for information on 
the default size of each data type. If you pass named common blocks as messages, you may also have 
to include the space taken up by padding within the common block, as discussed under “Message 
Passing with Fortran Commons” on page 3-17. 

You can also send and receive zero-length messages. This is useful if the message type is sufficient, 
and there is no need to supply any message content. For example, one process could tell another 
process to start or stop doing something by sending a zero-length message of type 1 to start, or a 
zero-length message of type 2 to stop. 


Message Type 

The type of a message is an integer whose meaning is determined by the programmer. 

All message-sending system calls have a type parameter that specifies the type of the message sent. 
You can use any integer from 0 to 999,999,999 (inclusive) as a message type. 

All message-receiving system calls have a typeset parameter that specifies the type (or types) of 
messages the call will receive. A typesel parameter can be an integer from 0 to 999,999,999 (to 
receive only messages of the specified type) or the special value -1 (to receive messages of any type). 

There are also special message types outside the range 0 to 999,999,999, called force types and 
typesel masks, that you can use. Sending with a force type sends a message that uses a limited flow 
control technique; receiving with a typesel mask receives messages of a selected set of types. See 
the Paragon ™ Fortran System Calls Reference Manual or Paragon™ C System Calls Reference 
Manual for information on these special message types. Note, though, that in Paragon OSF/1 regular 
messages are just as fast as force type messages, so force types are not needed for performance. 


Message ID 

The ID of a message is an identifier used to check for the completion of asynchronous messages. 
Synchronous messages do not have IDs. 

When you send or receive a message with an asynchronous message-passing call (one that returns 
before the message is completely sent or received), the call returns an ID that you can use to check 
whether or not the send or receive is complete. See “Asynchronous Send and Receive” on page 3-10 
for more information on message IDs. 
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Message Order 

Paragon OSF/1 guarantees that all messages will arrive in the same order they are sent That is, if 
one message is sent from node A to node B, then a second message is sent from node A to node B, 
the second message will never arrive before the first. 

Although the first message always arrives at the node first, you can elect to receive the second 
message—that is, to copy its contents into a buffer in user memory—before the first. You do this by 
specifying different message types in the send calls on node A, and specifying the second message’s 
type in the first receive call on node B. 

Names of Send and Receive Calls 


You can tell what each message-passing call does by examining its name. 

The first character of the name indicates whether the call is synchronous, asynchronous, or handled: 


c 

Synchronous (complete) call. These calls do not return until the message is 
complete. They are discussed under “Synchronous Send and Receive” on 
page 3-8. 

i 

Asynchronous (incomplete) call. These calls return immediately, so your 
program can do other work while the message is processed. They are 
discussed under “Asynchronous Send and Receive” on page 3-10. 

h 

Asynchronous with interrupt handler (handled) call. Like the i...O calls, the 
h.„0 calls return immediately. Unlike the i.„0 calls, h.,,0 calls indicate that 
the message is complete by calling a user-supplied interrupt handler. They are 
discussed under ‘Treating a Message as an Interrupt” on page 3-18. 

The initial c, i. 

or h is followed by a verb that indicates what the call does: 

send 

Send a message. 

recv 

Receive a message. 

sendrecv 

Send a message and receive the reply. 

probe 

Probe for a pending (not yet received) message. 

Finally, the verb may be followed by an x to indicate that it is an “extended” version of the call (see 
“Treating a Message as an Interrupt” on page 3-18 and “Extended Receive and Probe” on page 


3-24). 
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The synchronous calls with no additional functionality, such as csendO, are the easiest to understand 
and use. However, the asynchronous calls (such as isendO) and the calls with additional 
functionality (such as crecvxO) can offer dramatic improvements in performance when properly 
used. 


Synchronous Send and Receive 

Synopsis Description 

csend(rype, buf, count, node, ptype) Send a message, waiting for completioa 

crec\(typesel, buf, count) Receive a message, waiting for completioa 

csendrecvfrype, sbuf, scount, node, ptype. Send a message and post a receive for the reply. 

typesel, rbuf, rcount ) Wait for completion. 


The c.„0 message-passing calls perform synchronous sends and receives. 

• A synchronous send means that the program executing the send waits until the send is complete. 
This waiting is referred to as blocking. Completing the send, however, does not guarantee that 
the message has been received. It only means that the message has left the sending process and 
that the buffer can be reused. You use csendO to perform a synchronous send. 

• A synchronous receive means that the program executing the receive waits until the message 
arrives in the specified buffer. You use crecv() to perform a synchronous receive. 

• A csendrecvO is like a csendO followed by a crecv(). It returns the length of the received 
message. 

Here are two code fragments in C that perform a synchronous send and a synchronous receive. 

• Node 1 sends a message of type 0 to the process with the same process type on node 0: 

#include <nx.h> 

#define MSG_TYPE 0 
#define DEST_NODE 0 
char send_buf[100]; 


csend(MSG_TYPE, send_buf, 

sizeof(send_buf), DEST_NODE, myptype()); 
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• Node 0 receives the message: 

#include <nx.h> 
#define MSG_TYPE 0 
char recv_buf[100]; 


crecv(MSG_TYPE, recv__buf, sizeof (recv_buf) ) ; 

See “Extended Receive and Probe” on page 3-24 for information on a version of the crecv() call with 
additional functionality. 


Synchronous Send to Multiple Nodes 


Synopsis Description 

gsendxOype, buf, count, nodes, nodecount) Send a message to a list of nodes, waiting for 

completion. 


The gsendxO call sends a message to multiple nodes. Specifically, it performs a synchronous send 
of the message specified by the type, buf, and count arguments to the process with the same process 
type as the caller on the nodes specified by the nodes argument. The nodes argument is an array of 
integers; the nodecount argument specifies the number of nodes in nodes. 

For example, the following code fragment in Fortran sends the data in the array x to nodes 1 and 3: 

integer*4 nodenums(2), x(10) 


nodenums(1) =1 
nodenums(2) = 3 

call gsendx(100, x, 10*4, nodenums, 2) 
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Asynchronous Send and Receive 


Synopsis 

Description 

isend(type> buf, count , node, ptype) 

Send a message without waiting for completion. 

irec \(typesel 9 buf 9 count ) 

Receive a message without waiting for 
completion. 

isendrecv(fy/?e, sbuf, scount, node, ptype, 
typeset, rbuf, rcount ) 

Send a message and post a receive for the reply 
without waiting for completion. 

msgdonefmtd) 

Determine whether a send or receive operation 
has completed. 

msgwait(mi<Q 

Wait for completion of a send or receive 
operation. 

msgignore(muO 

Release a message ID as soon as a send or receive 
operation completes. 


The i...() message-passing calls perform asynchronous sends and receives. The msgdoneO and 
msgwaitO calls are used with the i...() calls to determine when the message has completed; the 
msgignoreO call is used to discard a message ID as soon as the message has completed. 

Unlike a synchronous send or receive, an asynchronous send or receive does not block. It returns a 
unique message ID, which is not reused until released. You can use this ID to check for completion 
at a later time. 


NOTE 

The number of message IDs is limited, so you must release each 
ID after you use it. See “Releasing Message IDs” on page 3-12 for 
information on releasing message IDs. 


You use isendO to perform an asynchronous send, and irecvO to perform an asynchronous receive. 
An isendrecvO is like an isendO followed by an irecvO, except that it returns only one message ID 
(for the receive). Asynchronous sends can be used together with synchronous receives, and vice 
versa. For example, a message sent by isendO could be received by crecv(). 
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You must make sure that an asynchronous operation has completed before you change the contents 
of the send buffer or use the contents of the receive buffer. To check if an asynchronous operation 
has completed, use the msgdoneO call. It returns 1 if an asynchronous call has completed and 0 
otherwise. To block until an asynchronous operation has completed, use the msgwaitO call. Both 
msgdoneO and msgwaitO take the message ID as an input parameter. 

The message ID belonging to an asynchronous receive is distinct from the message ID belonging to 
any companion asynchronous send. For example, if node 0 sends a message with isendO and node 
1 receives the message with irecvO, the isend() has a different message ID from the irecv(). When 
the isend() completes, this does not indicate that the corresponding irecvO has completed. 

For example, assume that your application knows that it’s going to need a message up ahead. So it 
posts an asynchronous receive with irecvO. It then does work that does not require the message, 
believing that by the time it needs the message, it will have arrived. When the program comes to 
where it needs the message, it issues a msgwaitO- If the message has in fact arrived, the msgwaitO 
returns immediately. Otherwise, it blocks until the message arrives. Here is a Fortran code fragment 
that implements this technique. 

Node 1 does an asynchronous send: 

include ’fnx.h' 

integer result, msg_sid 
integer MSG_TYPE, DEST_NODE 
double precision send_buf(100) 
parameter (MSG_TYPE = 1) 
parameter (DEST_NODE = 0) 


msg_sid = isend(MSG_TYPE, send_buf, 
100*8, DEST_NODE, rayptype()) 


c 


Free the asynchronous send ID 
call msgwait(msg_sid) 
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Node 0 does the asynchronous receive: 

include f fnx.h’ 

integer result, msg_rid 
integer MSGJTYPE 

double precision rec_buffer(100) 
parameter (MSG_TYPE = 1) 


c Post the receive 

msg_rid = irecv(MSG_TYPE, rec_buffer, 100*8) 


c Now you need the message, 
c 

c Free the asynchronous receive ID 
call msgwait(msg_rid) 

When the msgwait() returns, the message has been received. You may have blocked on the 
msgwait() if the message had not yet arrived. You may now assign another value to msgjrid. 

See “Extended Receive and Probe” on page 3-24 for information on a version of the irecv() call with 
additional functionality. 


Releasing Message IDs 

Because Paragon OSF/1 has a limited number of message IDs, you must release IDs that are no 
longer needed. There are four ways to release a message ID: 

• You can call msgwait(). 

• You can keep calling msgdoneO until it returns 1. 

• You can call msgignore(). 

If you use msgignoreO, it tells the system to release the message ID as soon as the corresponding 
send or receive has completed. Note, though, that this leaves you with no way to determine whether 
or not the message has completed. In this case, your application must have some other means of 
synchronization to prevent the send or receive buffer from being used before the message is 
complete. 
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NOTE 

Re-using a send or receive buffer before the message is complete 
can result in unexpected behavior. Do not use msgignoreO 
unless you are certain this will not occur. 


Merging Message IDs 


Synopsis Description 

msgmerg e(midl , mid2 ) Merge two message IDs into a single ID that can 

be used to wait for completion of both operations. 


The msgmergeO call gives you a way to merge two or more message IDs together. It takes two 
message IDs as parameters, and returns a message ID that does not complete until both the messages 
identified by the input message IDs have completed. 

Once you have merged a message ID with msgmergeO. you should not use the input message IDs 
as arguments to msgwaitO. msgdoneO, msgcancelQ. or msgignoreO- The input message IDs are 
automatically released when the merged message IDs are waited for. 

For example, the following C code fragment posts two irecvOs, one for a message of type 1 and the 
other for a message of type 2, and then waits until both have completed: 

#include <nx.h> 

int midi, mid2, midg; 
char buflflO], buf2[10]; 


midi = irecv(l, bufl, 10); 
mid2 = irecv(2, buf2, 10); 

midg = msgmerge(midl, mid2); 

msgwait(midg); 

Note that midi and mid2 ate released by the msgwaitO call on midg. 

You can use a series of msgmergeO calls to merge multiple message IDs together. To help you do 
this, you can use the value -1 as one of the message IDs; msgmergeO returns the other message ID 
unchanged. 
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For example, the following Fortran code fragment uses a series of isend() calls to send the buffer buf 
as a message of type 1 to the process with the same process type on nodes 1 through 10, then waits 
for all of the isend()s to complete: 

include 'fnx.h’ 

integer i, mid 
integer buf(100) 

mid = -1 
i = 1 

do while (i . le . 10) 

mid = msgmerge(mid, isend(l, buf, 400, i, myptype())) 
i = i + 1 
end do 

call msgwait(mid) 

The message ID returned by each isendO call is merged together with the message IDs of the 
previous isendO calls into the merged message ID mid (the first message ID is merged with -1, 
yielding itself). Once all the isend()s have been posted, the program uses msgwalt() on the merged 
message ID to wait for all of the isend()s to complete. 


Probing for Pending Messages 


Synopsis 

Description 

cprob e(typesel) 

Wait for a message of a selected type to arrive. 

iprobeOypese/) 

Determine whether a message of a selected type is 
pending. 


When a message arrives for which no receive has been issued, it goes into a system buffer. It is 
referred to as a pending message: a message that is available for receipt, but not yet received. When 
you issue a receive for that message, the message is moved into the application’s buffer (the buffer 
you specify in the crecv() or irecvO call). If a receive has already been issued when the message 
arrives, it goes directly into the application’s buffer and bypasses the system buffo’. 

The cprobeO and iprobe() calls determine whether there is a message of a given type pending in the 
system buffer. You can use a message type from 0 to 999,999,999 to probe for a message of a 
specific type; the special value -1 to probe for a message of any type; or a typeset mask to probe for 
messages of a selected set of types (see the Paragon ™ Fortran System Calls Reference Manual or 
Paragon ™ C System Calls Reference Manual for information on typesel masks). 
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The cprobeO call is a blocking call. It takes a type selection parameter as input and returns when a 
message of the given type has arrived. The iprobe() call is similar to cprobeO. except that it is 
nonblocking. iprobeO returns 1 if the message is pending and 0 if it is not. 

cprobeO and iprobeO are not the only calls that probe for messages. See “Extended Receive and 
Probe” on page 3-24 for information on message-probing calls with additional functionality. 

Getting Information About Pending or Received 
Messages 


Synopsis 

Description 

infocountO 

Return size in bytes of a pending or received 
message. 

infonodeO 

Return node number of the node that sent a 
pending or received message. 

infop typeO 

Return process type of the process that sent a 
pending or received message. 

infotypeO 

Return message type of a pending or received 
message. 


The info.„0 calls return information about received or pending messages. You can obtain the size 
of the message, its type, and the node number and process type of the sending process. 

The return value of the info...O calls is defined only in the following cases: 

• After a crecvO, cprobeO, or msgwaitO. 

• After an iprobeO or msgdoneO returns 1. 

Note that you must issue the info.„0 call before you perform any other message-passing operatioa 
Otherwise, you will get information about the most recently received or pending message instead. 
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For example, the following C code receives a message of any type, then uses infotype() to determine 
what type of message was actually received: 


#include <nx.h> 

#define BIGNUM 262144 
long buf[BIGNUM], msg_type; 


crecv(-l, buf, sizeof(buf)); 
msg__type = infotype(); 

Another example: the following C code blocks until any message arrives, then allocates a buffer just 
large enough to hold the message and receives it: 


#include <nx.h> 
char *buf; 

long msg_type, msg_len; 


cprobe(-1); 

msg__type = infotype(); 

msg_len = infocount(); 

buf = (char *) calloc(msg_len, 1); 

crecv(msg_type, buf, msg_len); 


Between the cprobe() and the crecv(), the message is pending; it has arrived, but has not yet been 
received. Until the message is received, the contents of the message are not accessible to the 
program. 

The info...() calls are subject to the following special conditions: 

• The return value of the info...() calls is undefined after a msgwait() or msgdoneQ if the message 
ID in the msgwait() or msgdone() call is a “merged” message ID representing more than one 
message. See “Merging Message IDs” on page 3-13 for more information. 

• The return value of the info...() calls is undefined after a erecvx(), cprobexQ, or iprobex(), 
except if the last parameter is the special array msginfo . See “Extended Receive and Probe” on 
page 3-24 for more information. 

• If you issue an info...() call before doing any message passing, the call returns -1. 
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The info.~0 calls are not the only way to get information about a received or pending message. See 
“Extended Receive and Probe” on page 3-24 for information on message-receiving and 
message-probing calls that also return information about the received or pending message. 

Message Passing with Fortran Commons 

Fortran users often use common blocks to send messages that contain data elements of different 
types. For example, consider the named common containing a double precision number and an 
integer. It is good Fortran practice to put the largest data element first in the common list, as follows: 

integer i 

double precision d 
cormuon/msg/ d, i 

To send this common block, specify the name of the first common element as the buffer and the 
length of the entire common as the length. For example, to send the common block named msg, send 
the variable d with a length of 12 bytes (8 for the double precision variable plus 4 for the integer 
variable). The following csendO call sends msg to process ptype on node node. 

call csend(MSGTYPE, d, 12, node, ptype) 

If you put smaller data elements before larger data elements in common blocks, the compiler may 
have to insert padding, or “holes,” between the elements of the common block to preserve data 
alignment For example, if you define the common block named pmsg as follows, the compiler will 
place an invisible 4-byte pad between the end of i and the beginning of d to properly align d on an 
8-byte boundary: 

integer i 

double precision d 
common/pmsg/ i, d 

This padding has two effects: 

• If you send this common block as a message, you must include the padding in the length of the 

message. For example, even though pmsg contains the same two variables as msg, pmsg is 4 
bytes longer than msg because of the padding between i and d. To send pmsg to process ptype 
on node node, you would use the following call: 

call csend(MSGTYPE, i, 16, node, ptype) 
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• If another routine uses a different view of the same common block, you may have to add 

additional variables to the other routine’s declaration of the common block to take this padding 
into account. For example, if another routine wants to view d in pmsg as an array of two integers, 
it must declare pmsg as follows: 

integer i, ipad, id(2) 
common/pmsg/ i, ipad, id(2) 

The variable ipad corresponds to the 4-byte pad in the original routine’s declaration of pmsg. 
Without this variable, the position of id would not correspond to the position of d in the original 
common block. This variable is necessary if pmsg is shared between these two routines, whether 
or not the two routines run on different nodes. 

When possible, you should define common blocks with the largest data element first, to avoid 
padding completely. You should also use the %LOC function to determine the size of a common 
block and avoid specifying its size with a hard-coded constant. 

Treating a Message as an Interrupt 


Synopsis 

Description 

hsend(07>£, buf ; count, node, ptype, handler) 

Send a message and set up a handler procedure to 
be called when the send completes. 

hr ecv(typesel, buf count, handler) 

Receive a message and set up a handler procedure 
to be called when the receive completes. 

hsendrecv(fype, sbuf, scount ; node, ptype, 
typesel, rbuf, rcount, handler) 

Send a message and post a receive for the reply. 
Set up a handler procedure to be called when the 
reply arrives. 


The h...O message-passing calls perform asynchronous sends and receives. However, unlike the i.„() 
calls, the h...() calls let you establish a user-provided interrupt handler, which is called when the send 
or receive is complete. 

The h...() receive calls let you treat incoming messages as interrupts. For example, consider a 
program that performs some action based on the type of a received message. One way to implement 
this program is to block the program at a crecv() for messages of all types and then take appropriate 
action based on the value returned by infotypeO- 

Another way is to issue a number of hrecv() calls. Each call attaches a function to a particular 
message type or set of types. Your program does not block. You can continue with other work; but 
when die appropriate message comes, the attached function is called to take care of the message. 
(The message is stored in the receive buffer before the function is called.) 
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The handler function you define must be written in C and must have four arguments of type long. 
These arguments are passed the following values when the function is called: 

1 . Type of the message (as returned by infotypeO). 

2. Length of the message in bytes (as returned by infocountO). 

3. Node number of the process that sent the message (as returned by infonodeO). 

4. Process type of the process that sent the message (as returned by infoptype(». 

For example, here’s a C code fragment that attaches the functions functO() t functl(), and Junct2() to 
message types 0,1, and 2, respectively. The message types that have handlers are referred to as 
handled types . 

#include <nx.h> 

char bufO[100], bufl[100], buf2[100]; 
void functO(), functl(), funct2(); 

hrecv(0, bufO, sizeof(buf0), functO); 
hrecv(l, bufl, sizeof(buf1), functl); 
hrecv(2, buf2, sizeof(buf2), funct2); 

• /* Now perform other work. No blocking happens. */ 


The declaration of functlO looks like this (the other functions are similar): 

void functl(long type, long count, long node, long ptype) 
{ 


} 

When a message of type 1 arrives, the message is stored in the buffer specified in the hrecvO call 
(in this case, bufl), then functlO is called with the type and length of the message and the node 
number and process type of the sender as arguments. fiinctlO and the main program then run 
concurrently until functlO returns. (In previous releases of Paragon OSF/1 , the main program was 
interrupted and did not run at all until functlO returned.) 


CAUTION 

The handler runs in the same memory space as the main program 
(but they have separate stacks). 
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Because of this, parts of the main program may have to be protected from being executed at the same 
time as the handler; see “Preventing Interrupts” on page 3-22 for information on using masktrapO 
to do this. 


NOTE 

Once you have established a handler for a message type, do not 
attempt to receive a message of that type with a crecv...O or 
irecv...O call. 


hsendO operates the same as hrecvO, except that the handler is invoked when the send completes. 
(Note that completion of the send does not mean that the message has been received, only that the 
message has been sent and the send buffer can be reused.) hsendrecv() is like an isend() followed 
by an hrecv(), with the message ID of the isend() automatically released by msgignorcO- 

See “Extended Receive and Probe” on page 3-24 for information on a version of the hrecvO call 
with additional functionality. 


Passing Information to the Handler 


Synopsis 

hsendx(ry/?e, buf, count, node, ptype, xhandler, 
hparam) 


Description 

Send a message and set up an extended handler 
procedure to be called with the value hparam 
when the send completes. Allows handler sharing. 


hsendxO is identical to hsendO except that it has an additional parameter, hparam, which is passed 
to the handler when it is called. The declaration of a handler for hsendxO looks like this: 


void xhandler(long 
long 


{ 


type, long count, long node, long ptype, 
hparam) 


} 


3-20 



Paragon™ User's Guide 


Using Paragon™ OSF/1 Message-Passing System Cails 


You can use the hpararn parameter to write handlers that are shared among several hsendx() calls, 
each of which uses a different value of hparam to identify itself. For example, here is a C program 
fragment that sends two messages of type 0 to the process with process type 2 on node 1, then uses 
an hsendxQ handler to free each message buffer as soon as the message send completes: 

#include <nx.h> 

#include <malloc.h> 


#define NBUFS 2 

#define BUFFER_JSIZE 10000 


char *buf[NBUFS]; /* array of pointers to char */ 


void freemem(long type, long count, long node, long ptype, 
long hparam) 


if( (hparam >= 0) && (hparam < NBUFS) ) { 
free(buf[hparam]); 

} else { 

printf("freemem(): invalid value: %d\n", hparam); 

} 


} 


main(int argc, char **argv) 

{ 

/* allocate two buffers with malloc() */ 
buf[0] = malloc(BUFFER_SIZE); 
buf[1] = malloc(BUFFER_SIZE); 

• /* put data into the buffers */ 

• 

/* send them */ 

hsendx(0, buf[0], BUFFER_SIZE, 1, myptype(), freemem, 0); 
hsendx(0, buf[l], BUFFER_SIZE, 1, myptype(), freemem, 1); 

* /* Now perform other work */ 


} 

Note that you must take care that this handler is not called while the program is in the middle of a 
call to mallocO or freeO- If the handler attempts to free memory while another part of the program 
is allocating or freeing memory, mallocO ’s internal memory structures could become corrupted. 
You can prevent this by using the masktrapO call, described in the following section, to protect each 
mallocO and free() call elsewhere in the program that could be interrupted by this handler. 


3-21 




Using Paragon™ OSF/1 Message-Passing System Calls 


Paragon™ User’s Guide 


Preventing Interrupts 


Description 

Enable or disable interrupts for message handlers. 
Required to prevent corruption of global 
variables. 


If you have one or more handlers set up and you have some critical code that you do not want 
interrupted, use the masktrapO call. A masktrap(l) prevents any handler from running; a 
masktrap(O) re-enables handlers. Any pending interrupts are honored when the mask is removed. 
masktrapO returns the previous masking state (1 or 0). For example: 

hrecv(6,buf,sizeof(buf),myhandler); 

• 

• /* this code can be interrupted */ 

• /* by a message of type 6 */ 

• 

oldmask = masktrap(l); 

• /* critical code that must not be interrupted */ 
masktrap(oldmask); 

• /* this code can be interrupted again */ 


Synopsis 

masktrap(state) 


Note the use of the variable oldmask to save the value of the previous masking state before the call 
to masktrap(l). This means that if the mask were already set before this call (for example, if this 
code is in a subroutine that could be called when the mask is already set), the following 
masktrap(oWmaj£) would not unset it. 


CAUTION 

You must use masktrapO around any code in the main program 
that could interfere with calls in the handler. 


For example, if the handler performs any I/O, you must put masktrapO calls around any I/O calls 
(such as printfO) in the main program that could be called while the handler is active. If you don’t 
do this, you could find characters from the handler’s output interleaved with characters from the 
main program’s output. 
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Sometimes, though, it’s not as obvious which calls could interfere with each other. For example, any 
two library calls that could allocate or free memory could cause the memory subsystem to become 
confused if they were called at the same time. To be on the safe side, keep the handler as simple as 
possible and use masktrapO to protect all library calls in the rest of the program that could call the 
same subsystems as the calls in the handler while the handler is active. 

These calls to masktrapO are necessary because, when the handler is active, the handler and the 
main program share the same memory space and can change each other’s global variables. This 
could cause any non-reentrant function to fail if it is called by both at the same time. 

If the handler performs any message passing, any info.„0 call in the main program must be within 
the same set of masktrapO calls as the message-receiving call to which it applies. Otherwise, the 
info.„0 call could reflect the value of a message received within the handler. 
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Extended Receive and Probe 

Synopsis Description 

crec \x(typesel, buf, count, nodesel, ptypesel. Receive a message of a specified type from a 

info) specified sending node and process type, together 

with information about the message. Wait for 
completioa 

irec\x(typesel, buf, count, nodesel, ptypesel. Receive a message of a specified type from a 
info) specified sending node and process type, together 

with information about the message. Do not wait 
for completioa 

hrec \x(typesel, buf, count, nodesel, ptypesel. Receive a message of a specified type from a 

xhandler, hparam) specified sending node and process type. Set up an 

extended handler procedure to be called with 
information about the message and the value 
hparam when the receive completes. 

cprobex(typesel, nodesel, ptypesel, info) Wait for a message of a specified type from a 

specified sending node and process type. Return 
information about the message. 

iprobexfiype.ve/, nodesel, ptypesel, info) Determine whether a message of a specified type 

from a specified sending node and process type is 
pending. If it is, return information about the 
message. 

The extended receive and probe calls, crecvxO, irecvx(), hrecvx(), cprobexO, and iprobexO, can 

be used to receive or probe for a message from a particular node or a particular process type, and 

return information about the message along with the message (instead of using the info...O calls). 

crecvxO, irecvxO, cprobexO, and iprobexO are like crecvO, irecvO, cprobeO, and iprobeO, 

except that they have the following additional parameters: 

nodesel Specifies the node that sent the message, or -1 for any node. 

ptypesel Specifies the process type that sent the message, or -1 for any process type. 
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info An array of eight long integers that receives information about the specified 

message. The following information is stored into the first four elements of 
this array: 

• Type of the message (as returned by infotypeO). 

• Length of the message in bytes (as returned by infocountO). 

• Node number of the process that sent the message (as returned by 

infonodeO). 

• Process type of the process that sent the message (as returned by 

infoptype())- 

The remaining four elements of the array are reserved. 

hrecvxO is like hrecvO, except that it has the same nodesel and ptypesel parameters as the other 
...x() calls and the same hparam parameter as the hsendx() call. hrecvxO does not have an info 
parameter, because the corresponding information is passed to the handler when it is called. 

The info parameter of crecvxO, irecvxO, cprobexO, and iprobex() must be specified and must not 
be zero or null. If you do not want this information, or you want it to be available to the info...() calls, 
specify the special array msginfo, defined in nx.h or fnx.h. The array msginfo is used by the non-j 
versions of these calls, and the info...() calls get their information from msginfo. This is why you 
cannot use the info.,,0 calls after crecvxO, cprobexO, or iprobex() unless you specify msginfo as 
the last parameter of the extended receive or probe call. 

The info parameter of irecvxO does not contain valid data until the message is received (as 
determined by msgdoneO or msgwaitO). The info parameter of iprobex() does not contain valid 
data unless the iprobexO returns 1. 
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Global Operations 


Synopsis 

Description 

gcol(x, j den, y, ylen, ncnt ) 

Concatenation. 

gcolx(x, xlens, y) 

Concatenation for contributions of known length. 

gdhigh(jc, n, work) 

Vector double precision MAX. 

gdlow(x, n, work) 

Vector double precision MIN. 

gdprod(jc, w, work) 

Vector double precision MULTIPLY. 

gdsum(x, n, work) 

Vector double precision SUM. 

giand(jt, n, work) 

Vector integer bitwise AND. 

gUiigh(x, n, work) 

Vector integer MAX. 

gHow(jt, n, work) 

Vector integer MIN. 

gior(x, n, work) 

Vector integer bitwise OR. 

giprod(x, n, work) 

Vector integer MULTIPLY. 

gisum(x, n, work) 

Vector integer SUM. 

glandfi, n, work) 

Vector logical AND. 

glor(;t, n, work) 

Vector logical inclusive OR. 

gopf(x, xlen, work. Junction) 

Arbitrary commutative function. 

gshigh(;c, n, work) 

Vector real MAX. 

gslowfr, n, work) 

Vector real MIN. 

gsprod(x, n, work) 

Vector real MULTIPLY. 

gssum(jc, n, work) 

Vector real SUM. 

gsyneO 

Global synchronization. 


The g...O calls perform operations that use data from every node in the application. In general, when 
you make one of these calls, each node contributes a piece of data to the operation, the operation is 
performed on the whole collection of data, and then the result of the operation is returned to each 
node. 

These operations are synchronizing calls: if any node in an application makes one of these calls, it 
blocks until every node in the application has made the same call. (In the simplest case, gsync(), this 
synchronization is the only operation performed by the call.) One process on each node in the 
application must make the call, and all the processes that make the call must have the same process 
type. 

The global operations are implemented using dynamic algorithm selection for maximum 
performance. The system considers several ways of exchanging the needed information between the 
nodes, and selects the one that minimizes the time required to perform the global operation given the 
size and shape of the application. 
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ioctIO fills in the elements of this structure with information about the device. The value of the 
mtjype field is always OxOC (indicating a generic SCSI device). The values of the mtjisreg and 
mterreg fields are device-dependent. 

For example, the following C program prints the status of the device connected to Idev/io0/rmt6 : 

#include <fcntl.h> 

#include <errno.h> 

#include <sys/mtio.h> 

main() { 

int fd; 

struct mtget s; 

fd = open("/dev/io0/rmt6", 0_RD0NLY, 0666); 
if(fd == -1) { 

perror("opening /dev/io0/rmt6"); 
exit(1); 

} 

if (ioctlffd, MTIOCGET, &s) == -1) { 
perror("getting status of tape"); 
exit(2); 

} 

printf("mt_type = 0x%x\n", s.mt_type); 
printf("mt_dsreg = 0x%x\n", s.mt_dsreg); 
printf("mt_erreg = 0x%x\n", s.mt_erreg); 
printf("mt_resid = 0x%x\n", s.mt_resid); 

} 
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Introduction 

Paragon OSF/1 system calls are available to all programs running on the Intel supercomputer. These 
system calls provide a variety of specialized functions that let processes running on different nodes 
share data and coordinate their activities. 

This chapter introduces the Paragon OSF/1 system calls that perform general services other than 
message passing. It includes the following sections, each of which describes a group of related calls: 

• Managing applications. 

• Managing partitions. 

• Listing unusable nodes. 

• Handling errors. 

• Controlling floating-point behavior. 

• Miscellaneous calls. 

• iPSC® system and Touchstone DELTA system compatibility calls. 

Within each section, the calls are discussed in order of increasing complexity. That is, the “base” 
calls are discussed first, and the “extended” calls are discussed later. 

Each section includes numerous examples in both C and Fortran. A call description at the beginning 
of each section or subsection gives a language-independent synopsis (call name, parameter names, 
and brief description) of each call discussed in that section. Differences between C and Fortran are 
noted where applicable. See Appendix A for information on call and parameter types; see the 
Paragon M C System Calls Reference Manual or the Paragon ™ Fortran System Calls Reference 
Manual for complete information on each call. 
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This chapter does not describe all the Paragon OSF/1 system calls. For information about system 
calls that perform message passing, see Chapter 3. For information about the calls used with the 
Parallel File System, see Chapter 5. For information about the calls used with graphical interfaces, 
such as DGL and the X Window System, see the Paragon Graphics Libraries User’s Guide. For 
information about the system calls that require root privileges, see the Paragon ™ System 
Administrator’s Guide. 

Paragon OSF/1 programs written in C can also issue OSF/1 system calls. The Paragon OSF/1 
operating system is a complete OSF/1 system and fully supports all the standard OSF/1 system calls. 
See the OSFI1 Programmer’s Reference for information on these calls. 

Paragon OSF/1 programs written in Fortran cannot make OSF/1 system calls directly, but the 
Fortran runtime library includes a number of system interface routines. These routines make a 
number of OSF/1 system calls available to Fortran programs. See the Paragon “ Fortran Compiler 
User’s Guide for information on these routines. 


Managing Applications 

Paragon OSF/1 provides system calls that let you create parallel applications, control their 
execution, and get information about them. See “Running Applications” on page 2-11 and 
“Managing Running Applications” on page 2-23 for introductory information on applications. 
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Controlling Application Execution with System Calls 


Synopsis 

nxJakt\e(partition, size, account, argc, argv ) 

nx_inltve_rect(parririon, anchor, rows, cols, 
account, argc, argv) 

nx_pri(pgroup, priority) 

nx_nfork (node list, numnodes, ptype, pidjist) 

nxjoadi (nodejist, numnodes, ptype, pidjist, 
pathname) 

nxjoudveinodelist, numnodes, ptype, 
pidjist, pathname, argv, envp) 

nx_waltall() 


Description 

Create a new application. 

Create a new application with a rectangular shape. 

Set the priority of an applicatioa 

Copy the current process onto some or all nodes of 
an application. 

Execute a stored program on some or all nodes of 
an application. 

Execute a stored program on some or all nodes of 
an application, with specified argument list and 
environment 

Wait for all application processes. 


The simplest way to control the way an application executes is to use the command-line switch -nx 
when you link the application. (See “Compiling and Linking Applications” on page 2-5 for more 
information on the -nx switch.) When you execute a program that was linked with -nx, the program 
is automatically copied onto the specified number of nodes in the specified partition, runs, and then 
when all the nodes have finished you get your prompt back. 

The code linked in by -nx reads the command line and environment variables, then performs the 
following actions for you (see “Controlling the Application’s Execution Characteristics” on page 
2-13 for more information): 

• Creates a new, empty application in the partition specified by the -pn switch and on the nodes 
of that partition specified by the -sz or -nd switch. If -pn is not used, the partition is specified 
by $NXJ)FLT_PART, or .compute if $NX_DFLT_PART is not set. If neither -sz nor -nd is used, 
the number of nodes is specified by $NX_DFLT_SIZE, or all nodes of the partition if 
$NXJ)FLT_SIZE is not set. 

• Sets the application’s priority to the value specified by -prl (default 5). 

• Loads and starts your program(s) on the nodes specified by -on (default all nodes of the 
application) with the process type specified by -pt (default 0). 

The nx_...() system calls perform the same actions as those of the code linked in by -nx, but under 
program control instead of command-line control. Using these calls is more complicated than using 
-nx, but gives your program more flexibility and control. 
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NOTE 

If you use nx_initveO or nx_initve_rectO, do not link the program 
with the -nx switch. Use the switch -Inx instead. 


The switch -Inx links in the library libnx.a, which contains all the calls discussed in this manual, but 
does not link in the automatic start-up code linked in by -nx. 


Creating an Application with nx_initve() 

nxJnitveO creates a new, empty application. The process that calls nx_initve() becomes the new 
application’s controlling process ; see “The Controlling Process” on page 4-21 for information on 
what this means. 

The partition and size of the new application can be specified by parameters or by the command line; 
the priority and mp switches are specified by the command line. If command-line switches are not 
used or the command line is ignored by specifying zero for argc, the application’s execution 
characteristics default as discussed under “Controlling the Application’s Execution Characteristics” 
on page 2-13 and the mpjwitches default as discussed under “Message-Passing Configuration 
Switches” on page 8-18. 

nx_initve() just allocates the specified number of nodes from the partition; it does not start any 
processes. (This allocation may or may not be exclusive, depending on the characteristics of the 
partition.) You must call nx_nfork(). nx_load(), or nx_loadve() to start processes in the new 
application. The nodes allocated to the application are automatically deallocated when all the 
processes in the application have terminated. 

Another effect of nxJnitveO is to make sure that the calling process is a process group leader. If 
the calling process is not already a process group leader, nx_initve() creates a new process group, 
removes the calling process from its current process group, and makes the calling process the new 
process group’s leader. If you’re not familiar with these terms, see “Process Groups” on page 4-22. 

Finally, nx_initve() also initializes the data structures required by all the other calls described in this 
manual. In an application linked with -nx, the code linked in by -nx performs this initialization 
before the application starts up, so you can use these other calls anywhere in the application. In an 
application linked with -Inx, however, you must call nxJnitveO before you can use any of the other 
calls described in this manual. If called before nxJnitveO, these other calls will fail; the way they 
fail depends on the call (as described under “Handling Errors” on page 4-42). For example, if you 
call csendO before calling nxJnitveO, the csendO prints an error message and terminates the 
calling process. 
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The parameters of nxinitveO have the following meanings: 

partition The relative or absolute partition pathname of the partition to run the 

application in, or a null string (""or NULL in C,"" in Fortran) to use the 
default partition (the partition specified by $NX_DFLT_PART, or .compute if 
$NX_DFLT_PART is not set). The specified partition must exist and must 
give execute permission to the calling process. 

If the user specifies a partition with the -pn switch on the command line, it 
overrides the value of the partition parameter (unless you set the argc 
parameter to 0, as described later in this section). 

See “Partition Pathnames” on page 2-28 for more information on partition 
pathnames; see “Owner, Group, and Protection Modes” on page 2-32 for 
more information on partition permissions. 

size The size of the application (number of nodes to run the application on), or 0 

to use the default size (the size specified by $NX_DFLT_SIZE, or all nodes of 
the partition if $NX_DFLT_SIZE is not set). 

nx_initveO attempts to allocate a square group of nodes if it can. If this is not 
possible, it attempts to allocate a rectangular group of nodes that is either 
twice as wide as it is high or twice as high as it is wide. If this is not possible, 
it allocates any available nodes. In this case, nodes allocated to the application 
may not be contiguous. 

If the user specifies the -sz or -nd switch on the command line, it overrides 
the value of the size parameter (unless you set the argc parameter to 0, as 
described later in this section). 

account In the future, this parameter will be used for accounting information. For now, 

it must be a null string ("" or NULL in C, "" in Fortran). 

argc In C, a pointer to an integer whose value is the number of arguments on the 

command line (counting the application name). If the value of this integer is 
0, the command line is ignored. You can use a pointer to the argc parameter 
of mainO, or you can construct the command line yourself. 

In Fortran, this parameter is any nonzero value to search the command line, 
or 0 to ignore the command line. 
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argv In C, a pointer to the command-line arguments, which may include arguments 

that specify application characteristics. You can use the argv parameter of 
mainO, or you can construct the command line yourself. 

In Fortran, nx_initve() gets the command line directly from the system, 
because Fortran programs don’t have an argv parameter. This parameter is 
ignored; it should always be 0. 

In either language, if any of the command-line arguments -sz size, -sz hXw, 
-nd hXw:n, -pri priority, -pn partition, -pkt packetsize, 

-mbf memory_buffer, -mex memory export, -mea memory each, 

-sth sendthreshold, -set sendjcount, -gth givejhreshold, or -plk is found in 
the command line: 

• The appropriate application characteristic is set as specified by the 
argument. 

• The argument is removed from argv. 

• The variable pointed to by arge is decremented appropriately. 

Any remaining arguments are moved to the beginning of argv for your 
program’s use. 

Note that the arguments -pt type, -on nodelist, and N; application are not 
recognized by nx_Initve(). If you want your application to have the same user 
interface as an application linked with -nx, you must examine the argument 
list for these arguments and pass the appropriate values to nx_load() or 
nxJoadveO yourself. 

nx_initve() returns the number of nodes in the application, or -1 if any error occurs. 

For example, the following C call creates an application whose characteristics (partition, number of 
nodes, and so on) are determined by the user through command-line switches. If the user runs this 
program with no command-line switches, it runs on the user’s default number of nodes in the user’s 
default partition. 

#include <nx.h> 

main(int arge, char *argv[]) { 
int n; 

n = nx_initve("", 0, "", &argc, argv); 

After this call, the variable n contains the number of nodes in the new application, or -1 if any error 
occurred. The variable arge contains the count of arguments that were not recognized and removed 
by nx_initve(), and the array argv contains pointers to those arguments. 
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The following Fortran call creates an application on 50 nodes of the partition mypart, ignoring any 
command-line switches provided by the user 

include 'fnx.h' 
integer n 

n = nx_initve("mypart", 50, 0, 0) 

After this call, the variable n contains the number of nodes in the new application, or -1 if any error 
occurred. 

The following restrictions apply to nx_inltveO: 

• A single process cannot call nx_initveO more than once. 

• An application that calls nxinitveO cannot be linked with -nx. You must use -lnx instead. 

• If your application uses any signal handlers, you must set them up after the call to nx_initveO. 
See signal() in the OSFI1 Programmer’s Reference for more information on signal handlers. 

The reason you cannot use -nx when you link an application that calls nxinitveO is that the code 
linked in by -nx calls nx initveO itself, and nx initveO can only be called once in an application. 
If you do use -nx when you link, your application’s call to nx initveO (actually the second call to 
nxJnitveO) fails and returns -1. 


Creating a Rectangular Application with nx_initve_rectO 

nx_initve_rect() works exactly like nx initveO except that it requires that the nodes allocated to the 
application form a rectangle with a particular height and width. Optionally, it can also specify the 
rectangle’s location within the partition. The parameters of nx_initve_rect() are the same as those 
of nx initveO, except that instead of the size parameter it has the following three parameters: 

anchor The node number within the partition for the upper left comer of the 

rectangle, or -1 to allow the rectangle to be placed anywhere in the partition 
it will fit. 

rows The height of the rectangle. 

cols The width of the rectangle. 

If the specified rectangle of nodes is not available, the nx_initve_rect() call fails and returns -1 (even 
if the equivalent number of nodes is available with a non-rectangular shape). 
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NOTE 

All the restrictions and cautions in this manual that refer to 
nxJnitveQ also apply to nx_initve_rectO- 


If the user specifies a size or shape with the -sz or -nd switch on the command line, it overrides the 
values of these three parameters (unless you set the argc parameter to 0). nx_initve rect() never 
uses the environment variable $NX_DFLT_SIZE. 

For example, the following Fortran call creates an application 8 nodes high and 8 nodes wide (unless 
otherwise specified by command-line switches) anywhere it will fit in the user’s default partition; 

include 'fnx.h* 
integer n 

n = nx_initve_rect(" ", -1, 8, 8, "", 0, 0) 

The following C call creates an application 10 nodes high and 20 nodes wide whose upper left comer 
is node 0 (the upper left comer of the partition) in the partition mypart, ignoring any command-line 
switches provided by the user 

#include <nx.h> 
int n, i; 

i = 0; 

n = nx_initve_rect("mypart", 0, 10, 20, "", &i, NULL); 

After either of these calls, the variable n contains the number of nodes in the new application, or -1 
if any error occurred. 

Note that nx_initve_rect0 will fail if the exact specified rectangle is not available. If you just want 
to find out the application’s size and shape, rather than mandating a particular size and shape, you 
can use an ordinary nxjnitve(), followed by a call to nx_app_rect() (discussed under “Finding an 
Application’s Shape with nx_app_rect()” on page 4-16) to determine the height and width assigned 
by nx_iiiitve(). 
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Setting an Application’s Priority with nx_priO 

nx_pri() sets the specified application’s priority to the specified value. If you don’t call nx_pri() and 

the user doesn’t use the -pri switch, the default priority is 5. The parameters of nxjpriO have the 

following meanings: 

pgroup The process group ID of the application (see “Process Groups” on page 4-22 

for more information), or 0 to specify the application of the calling process. 
If the specified process group ID is not the process group ID of the calling 
process, the calling process’s user ID must either be root or the same user ID 
as the specified application. 

priority The new priority, an integer from 0 to 10 inclusive. 0 is the lowest priority, 

10 is the highest. 

nx_pri() returns 0, or -1 if any error occurs. 

For example, the following Fortran call sets the priority of the calling application to 7: 

include 'fnx.h' 
integer n 

n = nx_pri(0, 7) 

The following C call sets the priority of the application with process group ID 738423 to 0: 

#include <nx.h> 

int n; 

n = nx_pri(738423, 0); 

In each of these examples, the variable n is assigned 0, or -1 if any error occurred. 


4-9 



Using Other Paragon™ OSF/1 System Calls 


Paragon™ User’s Guide 


Copying a Process onto the Nodes with nx_nforkQ 


nx_nfork0 copies the process that calls it onto the specified set of nodes with the specified process 
type. It creates one child process on each specified node. nx_nfork() is similar to the standard OSF/1 
call forkO except that it can fork processes onto multiple nodes and specifies the process type for 
the child processes. The parameters of nx_nfork() have the following meanings: 

nodejist 

An array of integers, each of which specifies a node number within the 
application (no node number may appear more than once in this array). The 
calling process is copied onto each of the specified nodes. 

numnodes 

The number of node numbers in nodejist, or -1 to use all the nodes in the 
application (in which case nodejist is ignored). 

ptype 

The process type for each child process. 

pidlist 

An array of integers, into which are stored the OSF/1 process identifiers 
(PIDs) of the child processes. See “Using PIDs” on page 4-14 for more 
information. 


nx_nfork() returns the number of child processes created to the parent process and 0 to each child 
process, or -1 if any error occurs. 

For example, the following C calls create an application whose characteristics are specified by the 
user, then copy the calling process onto all nodes of the application. The process type of each child 
process is set to 0. 

#include <nx.h> 

Sinclude <sys/types.h> 

main(int argc, char *argv[]) { 
int n; 

pid_t pids[2000]; 

n = nx_initve("", 0, &argc, argv); 

n = nx_nfork(NULL, -1, 0, pids); 

Note that the nodejist argument is ignored when the numnodes argument is -1, so you can specify 
a NULL pointer in this case (in Fortran, you can use the value 0). After the call to nx_nfork(), the 
variable n contains the number of child processes created, or -1 if any error occurred; the first n 
elements of the array pids contains the PIDs of the child processes. If more than 2000 child processes 
are created, unexpected results will occur. 
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The following Fortran calls create an application with 100 nodes and copy the calling process onto 
the first 50 nodes of the application (nodes 0 through 49). The process type of each child process is 
set to 0. 

include ’fnx.h’ 
integer n 

integer nodes(50), pids(50) 

n = nx_initve("mypart", 100, "", 0, 0) 

do 2, i = 1, 50 

nodes(i) = i - 1 
2 continue 

n = nx_nfork(nodes, 50, 0, pids) 

After the call to nxjrfork(), the variable n contains 50, or -1 if any error occurred; the array pids 
contains the PIDs of the child processes. 


Loading a Program onto the Nodes with nxJoadQ 


nx_load() executes the specified file on the specified set of nodes with the specified process type, 
like nx_nfork0, nx_load() creates one child process on each specified node. The parameters of 
nx_loadO have the following meanings: 

nodejist 

An array of integers, each of which specifies a node number within the 
application (no node number may appear more than once in this array). The 
specified file is loaded onto each of die specified nodes. 

nwnnodes 

The number of node numbers in nodejist, or -1 to use all the nodes in the 
application (in which case node list is ignored). 

ptype 

The process type for each child process. 

pidjist 

An array of integers, into which are stored the OSF/1 process identifiers 
(PIDs) of the child processes. See “Using PIDs” on page 4-14 for more 
information. 

pathname 

The relative or absolute pathname of the file to load. 


nx_load() returns the number of child processes created, or -1 if any error occurs. 
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For example, the following Fortran calls create an application whose characteristics are specified by 
the user, then load and start the program myprog on all nodes of the application. The process type of 
each child process is set to 0. 

include 'fnx.h* 
integer n 
integer pids(2000) 

n = nx_initve ( "" , 0, 11 ", 1, 0) 
n = nx_load(0, -1, 0, pids, "myprog”) 

After the call to nx_load(), the variable n contains the number of child processes created, or -1 if any 
error occurred; the first n elements of the array pids contains the PIDs of the child processes. If more 
than 2000 child processes are created, unexpected results will occur. 

The following C calls create an application with 10 nodes in the partition myparu then load and start 
the program Jbin/myprog on nodes 1,5, and 7 of the application. The process type of each child 
process is set to 1. 

tinclude <nx.h> 

#include <sys/types.h> 
int n, i; 
int nodes[3]; 
pid_t pids[3]; 

i = 0; 

n = nx__initve( "mypart", 10, &i, NULL); 

nodes[0] =1; 
nodes[1] = 5; 
nodes[2] =7; 

n = nx_load(nodes, 3, 1, pids, "../bin/myprog"); 

After the call to nxJoad(), the variable n contains 3, or -1 if any error occurred; the array pids 
contains the PIDs of the child processes. 
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Loading a Program onto the Nodes with nxJoadveO 

nxJoadveO is just like nxJoadO except that it also lets you specify the argument list and 
environment variables for the new processes (in C). nxJoadveO has the following additional 
parameters: 

argv In C, this parameter contains the command line for the child process (you can 

use the argv parameter of main() or construct the command line yourself). 

envp In C, this parameter contains the environment variables for the child process 

(you can use the envp parameter of mainO or construct the environment 
yourself). 

In Fortran, you must specify the value 0 for the argv and envp parameters (or use nx JoadO instead). 
This is necessary because these parameters are pointers to arrays of strings, which have no 
equivalent in Fortran. 

nxJoadveO returns the number of child processes created, or -1 if any error occurs. If an error 
occurs, the value 0 is also stored into the pidjist for each process that was not successfully started. 

For example, the following C calls create an application as specified by the user (if not specified, the 
default number of nodes in the default partition), then set the value of the environment variable 
HOME to Itmp, then load and start the program myprog on all nodes of the application with process 
typeO: 

#include <nx.h> 

#include <stdlib.h> 

#include <sys/types.h> 
extern char **environ; 

main(int argc, char *argv[]) { 
int n; 

pid__t pids[2000]; 

n = nx_initve(NULL, 0, NULL, &argc, argv); 
putenv("HOME=/tmp”); 

n = nx_loadve(NULL, -1, 0, pids, "myprog", argv, environ); 

The argument list of myprog consists of any command-line arguments to the calling program that 
were not recognized and removed by nxJnitveO, and the environment of myprog is the same as the 
user’s environment except for the value of HOME . 
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Waiting for Application Processes with nx_waitallO 

nxjiforkO, nxJoad(), and nx_loadve() return immediately to the calling process. To wait for the 
processes created by nx_nfork(X nx JoadO, or nx_loadve() to complete, call nx_waitall(). 
nx_waitall() simply blocks until all the child processes of the calling process have terminated. It 
returns 0, or -1 if any error occurs. 

For example, the following Fortran calls create a new application as specified by the user, run the 
program myprog on all nodes of the application, and wait until all the node processes have 
completed: 

include ’fnx.h’ 
integer n 
integer pids(2000) 

n = nx_initve("", 0, 1, 0) 

n = nx_load(0, -1, 0, pids, "myprog") 
n = nx_waitall() 


Using PIDs 

The pidjist argument of nxjnforkO, nxJoad(), and nxJoadve() receives the OSF/1 process 
identifiers (PIDs) of the child processes created by the call. The specified array must be large enough 
to hold all the PIDs—that is, it must have at least as many elements as the maximum number of 
processes that could be created by the call. If more child processes are created than the number of 
elements in the pidjist , unexpected results will occur (the program will probably crash). 

In the typical case where you create one process per node of the application, you can use the value 
returned by nx_initve() to determine the number of nodes in the application, then use malloc() or an 
equivalent call to dynamically allocate a pid list with the same number of elements. For example, 
the following example allocates the appropriate number of elements to the array pids based on the 
application size specified by the user in argv: 

#include <nx.h> 

#include <stdio.h> 

#include <malloc.h> 


main(int argc, char **argv) { 
int nnodes; 
long *pids; 


nnodes = nx_initve(NULL, 0, NULL, &argc, argv); 
pids = (long *)calloc(nnodes, sizeof(long)); 
nx_nfork(NULL, -1, 0, pids); 
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If you don’t use dynamic allocation, you should give the pidjist as many elements as the number of 
nodes on the largest system on which the application will be run. For portability to very large Intel 
supercomputers, this array should have at least 1000 elements (and possibly more in the future). 

Each element in the pidjist receives the PID of the process on the node specified by the 
corresponding element of the nodejist. If numnodes is -1, the PID of the process on node 0 is stored 
into the first element of pidjist, the PID of the process on node 1 is stored into the second element 
of pidjist, and so on. If one or more processes were not successfully started, the value 0 is stored 
into the corresponding element of the pidjist. 


NOTE 

The PIDs stored into the pidjist are OSF/1 PIDs, not Paragon 
OSF/1 process types. 


OSF/1 PIDs are unique throughout the system; they are used with standard OSF/1 system calls such 
as klll(). (Note that kill() and other system interface routines are supported by the Fortran runtime 
library; see the Paragon ™ Fortran Compiler User’s Guide for information on these routines.) 
Paragon OSF/1 process types are unique only within a single application and a single node; they are 
used with Paragon OSF/1 message-passing calls such as csend(). 

For example, the following C calls create an application as specified by the user, run the program 
myprog on all nodes of the application with process type 0, and then send the signal SIGKILL to 
all the node processes: 

#include <nx.h> 

#include <signal.h> 

#include <sys/types.h> 

main(int argc, char *argv[]) { 
int n, i; 
pid_t pids[2000]; 

n = nx_initve(NULL, 0, NULL, &argc, argv); 
n = nx_load(NULL, -1, 0, pids, "myprog''); 

for(i=0; i<n; i++) { 

kill(pids[i], SIGKILL); 

} 
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Getting Information About Applications 


Synopsis 

Description 

n x_app_rect(r 0 H's, cols ) 

Obtain the height and width of the rectangle of 
nodes allocated to the current application. 

nx_app_ n °des(pgro«p, node list, list_size ) 

List the nodes allocated to an application. 

nxjpspart(partition, pspart list, list_size) 

Obtain information about all applications and 
active subpartitions in a partition (C only). 


To get information about applications once they are running, use nx_app_rect(), nx_app_nodes(), 
and nx_pspart(). nx_app_rect() returns the application’s shape (height and width of the rectangle 
of nodes allocated to the application); nx_app_nodes() returns a list of the nodes that are allocated 
to the application; and nx_pspart() returns information about all the active applications and 
subpartitions in a partition (like the pspart command). 


NOTE 

Do not call nx_app_nodesO or nx_pspartO on more than a few 
nodes at once. 


If many nodes use the application information calls at the same time, the allocator daemon can 
become overwhelmed with requests, which could slow down your application or reduce system 
stability. If all the nodes in your application need this information, you should have one node make 
the call and then distribute the information to the other nodes. Note, though, that nx_app_rectO is 
not subject to this restriction. 


Finding an Application’s Shape with nx_app_rectO 

Sometimes, in addition to its node number and the number of nodes in the application, a process 
needs to know the shape of the applicatioa For example, an application might use a different 
message-passing algorithm depending on whether the nodes allocated to the application form a 
square, a tall skinny rectangle, a short wide rectangle, or something else (such as a group of 
noncontiguous nodes). 

To find out the rectangular dimensions of the nodes allocated to the application, call nx_app_rectO. 
n x_app_rectO stores the height of the application into rows and the width of the application into 
cols. If the nodes allocated to the application do not form a rectangle, nx_app_rect() stores 1 into 
rows and numnodesO into cols. nx_app_rect() returns 0 if it is successful, or -1 if any error occurs. 
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For example, the following code fragment in Fortran stores the height of the application into y and 
the width of the application into x. 

integer*4 x, y, result 


result = nx_app_rect(y, x) 
The following code fragment in C does the same: 
long x, y, result; 


result = nx_app_rect(&y, &x); 

See “Specifying a Rectangle of Nodes” on page 2-16 for information on how to run your application 
on a rectangular group of nodes with a specific shape. 


NOTE 

nx_app_rectO can also be called by the name mypartO for 
compatibility with the Touchstone DELTA System. 


Listing an Application’s Nodes with nx_app_nodesO 

Occasionally you want to know an application’s physical location within the system. You can use 
this information to help track down possible hardware problems or make use of nodes with special 
hardware features (such as extra memory or special I/O interfaces). 
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To list the nodes allocated to an application, call nx_app_nodes(). nx_app_nodes() has the 
following parameters: 

pgroup The process group ID of the application (see “Process Groups” on page 4-22 

for more information), or 0 to specify the application of the calling process. 
If the specified process group ID is not the process group ID of the calling 
process, the calling process’s user ID must either be root or the same user ID 
as the specified application. 

node list Pointer variable into which nx_appjnodes() stores the address of the list of 

nodes. The call allocates the memory for this list; when you are finished using 
the information, you should release this memory by calling free(). 

listjize Variable into which nx_app_nodes() stores the number of entries in 

nodejist . 

The node numbers returned by nx_app_nodes() are node numbers from the root partition, which tell 
you where in the machine the application is located. nx_app_nodes() returns 0 for success, or -1 if 
any error occurs. 

For example, the following Fortran program fragment prints the root node numbers of the nodes on 
which the current application is running: 

include 'fnx.h’ 

integer*4 mynodes(1) 

pointer (p t r, mynode s) 

integer nnodes 

integer i, status 

status = nx_app_nodes(0, ptr, nnodes) 

if(status .ne. 0) then 

call nx__perror ( "nx_app_nodes () ") 
stop 
end if 

do 2, i = 1, nnodes 
print *, mynodes(i) 

2 continue 

call free(ptr) 
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The equivalent C code is as follows: 

#include <nx.h> 

nx_nodes_t mynodes; 

unsigned long nnodes; 
int i, status; 

status = nx_app_nodes(0, &mynodes, &nnodes); 

if(status 1= 0) { 

nx_jperror( M nx_app_nodes()"); 
exit(1); 

} 

for(i =0; i < nnodes; i++) { 
printf(”%d\n”, mynodes[i]); 

} 

free(mynodes); 

Note the use of the & operator on the variables mynodes and nnodes in the call to nxjippjnodes(). 


Listing the Applications in a Partition with nx_pspartO 

nx_pspart() returns information about each of the applications and subpartitions in a partition, like 
the pspart command. It is callable only from C, not Fortran. It has the following parameters: 

partition The relative or absolute pathname of the partition. The specified partition 

must exist and must give read permission to the calling process. 

pspart list Pointer variable into which nx_pspart() stores the address of an array of 
nx_pspartj structures. Each structure in the array describes one object 
(application or subpartition). The nx jpspartj structure is defined in 
allocsys.h, which is automatically included by nx.h and fnx.h. It includes the 


following fields: 


objectjype 

The type of the object described by this structure: 
NXAPPLICATION or NXPARTITION. (These 
are constants defined in nx.h or fnx.h). 

objectjd 

If the object is an application, this is its process group 
ID. If the object is a partition, this is an arbitrary value 
and should be ignored. 

uid 

The numeric user ID of the object’s owner. 
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gid The numeric group ID of the object’s group. 

size The number of nodes allocated to the object. 

priority The current priority of the object. 

rolled in The amount of time the object has been rolled in 

during the current rollin quantum, expressed as an 
integer number of milliseconds. 

rollin q The rollin quantum for the object’s parent partition 

(that is, the partition specified in the nx_pspart() call), 
expressed as an integer number of milliseconds. 

elapsed The total amount of time the object has been rolled in 

since it was started, expressed as an integer number of 
milliseconds. 

active Whether or not the object is currently rolled in: 1 if it 

is, 0 if it is not. 

time started The time the object was started, as returned by tiine(). 

(If the object is a subpartition, the time the oldest 
application in the subpartition was started.) 

nxjpspartO allocates the memory for the pspartjist array, when you are 
finished using the information, you should release this memory by calling 
free(). 

list_size Variable into which nx_pspart() stores the number of nx jpspartj structures 

in pspartjist. 

nx_pspart() returns 0 for success, or -1 if any error occurs. 
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For example, the following program fragment prints the numeric user ID and size for every 
application and subpartition in the partition mypart. 

#include <nx.h> 

nx_pspart_t *info; 
unsigned long nobjs; 
int status, i; 

status = nx_pspart(’’mypart", &info, &nobjs); 

if(status != 0) { 

nx_perror (’’nx__pspart ( ) ” ) ; 
exit(1); 

} 

for(i =0; i < nobjs; i++) { 

printf(”uid = %d, size = %d\n”, info->uid, info->size); 

J 

free(info); 

Note the use of the & operator on the structure info and the variable nobjs in the call to nx j>spart(). 


The Controlling Process 

By calling nx_initve(), a process creates a new application. The process that called nx_initve() 
becomes the new application’s controlling process. Each application has exactly one controlling 
process, and each process controls at most one application. 

The controlling process is a special process that creates and controls the application: 

• The controlling process can create new processes in the application, using the Paragon OSF/1 
function nx_nfork(), nxload(), or nx_loadve0- 

• The controlling process can wait for an application process to complete, using nx_waitall() or 
the standard OSF/1 function waitO or waitpidO- 

• The controlling process can send a signal to an application process, or terminate it, using the 
standard OSF/1 function kill(). In particular, the controlling process can send a signal to all the 
processes in the application (including itself) by using kill(0, signal). 

You can terminate the entire application by terminating the controlling process, using the kill 
command or your interrupt key (normally <ctrl-c> or <Del>). The controlling process always 
runs in the service partition; the application processes run in the partition specified by nx initveO. 
If the application processes are running in a gang-scheduled partition, the controlling process is 
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rolled in and out along with its application (that is, when the application is rolled out, the controlling 
process gets no processor time; when the application is rolled in, the controlling process gets its 
normal share of the service partition’s processor time). 

In OSF/1 terms, the controlling process is a parent process and the processes created by nx_nforkO, 
nxJoadO, or nx loadveO are its child processes. (In this respect, nx_nfork() is similar to fork(), 
nxJoadO is similar to a forkO followed by an execv() with a null argument list, and nx_loadve() 
is similar to a forkO followed by an execveO). The controlling process and the application processes 
all belong to the same process group, and the controlling process is the process group leader of the 
group. No process outside the application belongs to this process group. 

The controlling process does not usually do heavy computational work, because it runs in the service 
partition along with users’ shells and other interactive processes. Since it is scheduled interactively, 
the controlling process will not give as much throughput as application processes running in 
gang-scheduled compute partitions. 

See the OSF/1 Programmer’s Reference for information on wait(), waitpidO, killO, fork(), and 
execveO- 


Process Groups 

Process groups are a standard OSF/1 concept, not unique to Paragon OSF/1. A process group is a 
set of related processes. You can send a signal to all the processes in a group at once with killO, and 
you can wait for any process in a group with waitpidO. The processes in a process group also share 
access to a terminal, called the controlling terminal of the group. Each process belongs to exactly 
one process group. 

The processes in a process group are all children (or grandchildren, and so on) of the oldest process 
in the group, called the process group leader. The process group leader’s process ID (PID) is used 
to identify the group, and is also called the process group ID of the whole group. (Note that this is 
the process group leader’s OSF/1 PID, not its process type.) A process can determine its process 
group ID by calling getpgrp(). 

Normally, a process belongs to the same process group as its parent process. However, a process can 
leave its parent’s process group and start a new process group of its own by making such calls as 
setpgid(), setpgrpO, or setsidO. These calls create a new process group, then remove the calling 
process from its current group and place it in the new group. The calling process becomes the new 
group’s process group leader, and the caller’s PID becomes the new group’s process group ID. After 
that, any processes created by the process group leader belong to the new process group. See the 
OSF/1 Programmer’s Reference for information on setpgidO and getpgrpO- 
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Process Groups in Paragon™ OSF/1 

In Paragon OSF/1, process groups work the same as they do in standard OSF/1. In addition, 
nxJnitveO makes sure that the calling process is a process group leader. If the calling process is not 
already a process group leader, nx_initve() has the same effect as setpgid(): it creates a new process 
group and makes the calling process the new group’s process group leader. Because all the processes 
in the application are created by the controlling process, all the processes in an application are 
members of the same process group, and no other process in the system is a member of that process 
group. This means that the application’s process group ID uniquely identifies the application, which 
is why you use a process group ID to identify the application in nx_pri(). 

This also means that if a process in an application leaves the application’s process group by calling 
nx_initveO (or setpgidO, setpgrpO, or setsidO), it leaves the application. If a process leaves its 
application’s process group, it is no longer considered part of the application and can no longer 
exchange messages with the other processes in the application. You shouldn’t do this unless you 
know exactly what you are doing. 


Killing Application Processes 

You can take advantage of the fact that all the processes in the application are members of the same 
process group by using OSF/1 system calls that affect process groups. For example, specifying a 
process ID of 0 (zero) to killO sends the specified signal to all the members of the calling process’s 
process group, so the following call kills the entire application (including the calling process): 

kill(0, SIGKILL); 

This call differs from the example discussed under “Using PIDs” on page 4-14 in that it also kills 
the calling process. 


An Example Controlling Process 

The following C program (which must be linked with -tax, not -nx) copies itself onto eight nodes of 
the partition mypart with a process type of 0 and a priority of 7. The eight application processes print 
“Hello from node n” and then exit. The controlling process waits for the application processes to 
finish, then prints “Hello from controlling process” before exiting itself. Note that this program is 
executed by both the controlling process and the application processes. 
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#include <nx.h> 

#include <sys/types.h> 

#include <stdio.h> 

#define NUMNODES 8 

main(int argc, char **argv) { 
int n, i; 
pid_t pids[NUMNODES]; 

/* create application */ 

n = nx_initve("mypart", NUMNODES, NULL, &argc, argv); 
if(n == -1) { 

/* nx_initve() failed */ 
perror("nx_initve"); 
exit(1); 

} 

/* set application priority to 7 */ 

n = nx_pri(0, 7); /* 0 specifies calling application */ 
if(n == -1) { 

/* nx_pri() failed */ 
perror("nx_pri”); 
exit(1); 

} 

/* fork child processes onto all nodes of application */ 
n = nx__nfork(NULL, -1, 0, pids); 
if (n == -1) { 

/* nx_nfork() failed */ 
perror( M nx_nfork”); 
exit(1); 

} else if(n == 0) { 

/* child process: print ’’Hello" and exit */ 
printf (’’Hello from node %d l \n" , mynode ()); 
exit(0); 

} else { 

/* parent (controlling process) : wait for all children */ 
nx_waitall()? 

/* now print "Hello" and exit */ 

printf("Hello from controlling process!\n"); 

exit(0); 

} 

} 
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Message Passing Between Controlling Process and Application Processes 


Synopsis Description 

myhostO Obtain the controlling process’s node number. 


A controlling process can exchange messages with its child processes using the Paragon OSF/1 
message-passing calls described in Chapter 3. 

• The controlling process’s node number is equal to numnodes(). (The maximum node number 
within the application is numnodesQ - 1.) The controlling process’s node number is also 
returned by myhostO in any process in the applicatioa In the controlling process, myhostO, 
mynodeO, and numnodesO all return the same number. 

• The controlling process’s process type is initially INVALID PTYPE, but you can change it to 
a valid value by calling setptypeO- For best performance, you should not call setptypeO until 
after you have created all application processes with nx_nfork(), nxloadO. or nx_Ioadve(), 
and you should not call setptypeO at all unless you need to exchange messages with application 
processes. 

Although the controlling process can exchange messages with the application processes, it does not 
participate in global operations: 

• The controlling process may not make any of the calls described under “Global Operations” on 
page 3-27. 

• The controlling process does not participate when the application processes make any of the 
calls described under “Global Operations” on page 3-27. 

• The controlling process does not get messages sent to node number -1 (all nodes). 

A send to node -1 (all nodes) sends the message to all the nodes in the application (except the calling 
process’s node), but not the controlling process. This applies whether die message is sent by a node 
process or by the controlling process itself. On the other hand, an extended receive that specifies 
node -1 (any node) as the sending node will match a message from the controlling process. 

Here is an application, written in Fortran, that demonstrates message-passing between the 
controlling process and the application processes. This application multiplies two numbers (in a very 
inefficient way). It consists of two programs, control ./and app.f. You must link control./with -Inx, 
not -nx; app.f cam be linked with either -Inx or -nx. 
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The controlling process ( control.J) requests a number of nodes and an integer value from the user. It 
creates an application of the specified number of nodes on the partition mypart and loads the 
program app onto each node. It then sends the user’s integer value to each node as a message (note 
that the node number -1 sends to all nodes, not including the controlling process) and waits for a 
return message with the result. When the result is received, the controlling process prints its value 
and then exits. 


include ' fnx 

.h' 

integer 

num___nodes, n, i 

integer 

nodes( 128 ), pids( 128) 

integer 

app-ptype 

parameter 

( a PP_ptype = 0) 

integer 

data, result 

integer 

result^type, datatype 

parameter 

(result_type = 1) 

parameter 

(datatype = 2) 


c get number of nodes (limited to size of "nodes” and "pids" arrays) 

1 print *, "Enter number of nodes (must not be greater than 128)" 
read (*, *) num_nodes 

if(num_nodes .gt. 128) goto 1 

c create application of specified size 

n = nx_initve("mypart", num__nodes, "", 0, 0) 
if(n .eq. -1) then 

call nx__perror ( "nx__initve" ) 
stop 
end if 

c fill in node array for nx_load() 
do 2, i = 1, num__nodes 
nodes(i) = i - 1 

2 continue 

c load program "app" onto the nodes of the application 

n = nx_load(nodes, num_nodes, app__ptype, pids, "app") 
if(n .eq. -1) then 

call nx_perror ( "nx_load" ) 
stop 
end if 

c get an integer from the user 

print *, "Enter value to be summed" 
read(*,*) data 

c set my process type 

call setptype(app_ptype) 
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c send integer to all the nodes 

call csend(data_type, data, 4, -1, app_jptype) 


c receive the result 

call crecv(result_type, result, 4) 


c print the result 

print *, "The sum of ",data," over ", num_nodes, " nodes is ", result 
end 


The application process (app.f) waits for a message and performs a gisumO on the value received. 
(Note that the controlling process does not participate in the gisumO.) The process on node 0 sends 
the result to the controlling process, then all the application processes exit. 

include ’fnx.h’ 


integer 

integer 

parameter 

parameter 


val, work 

result_type, data_type 
(result_type = 1) 
(data_type = 2) 


c get an integer from the controlling process 
call crecv(datatype, val, 4) 


c sum it over all nodes 

call gisum(val, 1, work) 


c if I'm node 0, send the result back to the controlling process 

if(mynode() .eq. 0) call csend(result_type, val, 4, myhost(), 0) 

end 


Managing Partitions 

Paragon OSF/1 provides system calls that let you create and remove partitions, get information about 
partitions, and change their characteristics, like the mkpart, impart, showpart, and clipart 
commands described in Chapter 2. See “Managing Partitions” on page 2-25 for introductory 
information on partitions. 
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Making Partitions 

Synopsis Description 

nx_mkpart(partition, size, type) Create a partition with a particular number of 

nodes. 


nx_mkpart_rect(parftft0rc, rows, cols, type) Create a partition with a particular height and 

width. 

nx_mkpart_map(pararion, numnodes. Create a partition with a specific set of nodes. 

nodejist, type) 


To create a partition, use nx_mkpart(), nx_mkpart_rect(), or nx_mkpart_map(). These calls all 

create a partition, but they use different methods to specify the nodes allocated to the new partition: 

• nx_mkpart() works like the mkpart command’s -sz size switch. 

• nx mkpart rectO works like the mkpart command’s -sz hXw switch. 

• nx mkpart mapO works like the mkpart command’s -nd nodespec switch (except that only 
node numbers can be specified). 

See “Specifying the Nodes Allocated to the Partition” on page 2-40 for more information on these 

switches. 

These calls have the following parameters: 

partition The new partition’s relative or absolute pathname. The specified new 

partition must not exist; the parent partition of the specified new partition 
must exist and must give write permission to the calling process. See 
“Partition Pathnames” on page 2-28 for more information on partition 
pathnames; see “Owner, Group, and Protection Modes” on page 2-32 for 
more information on partition permissions. 

size The number of nodes of the new partition, or -1 to specify “all the nodes of 

the parent partition.” If you specify a size smaller than that of the parent 
partition, the nodes are selected by the system (and are not necessarily 
contiguous). 

nx_mkpart() attempts to allocate a square group of nodes if it can. If this is 
not possible, it attempts to allocate a rectangular group of nodes that is either 
twice as wide as it is high or twice as high as it is wide. If this is not possible, 
it allocates any available nodes. In this case, nodes allocated to the partition 
may not be contiguous. 
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rows and cols The height and width of the new partition. The new partition is a rectangle 
with the specified number of rows and columns, but its location within the 
parent partition is selected by the system. 

numnodes and nodejist 

The exact node numbers within the parent partition for the new partition. The 
nodejist parameter is an array of node numbers; the numnodes parameter 
specifies the number of elements in nodejist. 

type The new partition’s scheduling type: NXSTD to specify standard 

scheduling, NXGANG to specify gang scheduling, or NXSPS to specify 
space sharing. The names NX STD, NX GANG, and NX SPS are defined 
in nx.h and fnx.h. See “Scheduling Characteristics” on page 2-33 for more 
information on the different scheduling types. 

nx_mkpart(). nx_mkpart_rect(), and nx_mkpait_mapO return the number of nodes in the new 
partition, or -1 if any error occurs. 

The new partition’s owner and group are set to the owner and group of the calling process. All other 
partition characteristics not specified in the call (such as protection modes and rollin quantum) are 
set to the same values as the parent partition. Once the partition is created, you can use the 
nx_chpart..O calls to set these characteristics to different values, as discussed under “Changing 
Partition Characteristics” on page 4-36. 

For example, the following Fortran call creates a new gang-scheduled partition called newpart 
whose parent partition is the compute partition (using a relative partition pathname) and which 
consists of all the nodes in the compute partition: 

include ’fnx.h' 
integer n 

n = nx_mkpart("newpart", -1, NX_GANG) 

The following C call creates a new space-shared partition called mypart whose parent partition is the 
compute partition (using an absolute partition pathname) and which has 54 nodes: 

#include <nx.h> 

int n; 

n = nx_mkpart(".compute.mypart", 54, NX_SPS); 

The following C call creates a new gang-scheduled partition called rect whose parent partition is 
mypart and which is 3 nodes high and 4 nodes wide: 

#include <nx.h> 

int n; 

n = nx_mkpart_rect(".compute.mypart.rect", 3 , 4 , NX_GANG); 
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The following C call creates a new space-shared partition called comers whose parent partition is 
rect and which consists of the four nodes at the comers of rect. 

#include <nx.h> 
long nodes[4]; 
int n; 

nodes[0] = 0; 
nodes [1] = 3; 
nodes[2] = 8; 
nodes [3] = 11; 

n = nx_mkpart_map(".compute.mypart.rect.corners”, 4, 
nodes, NX_SPS); 

In each of these examples, the variable n is assigned the number of nodes in the new partition, or -1 
if any error occurred. 


Removing Partitions 

Synopsis Description 

nx_rmpart (partition, force, recursive) Remove a partition. 


To remove a partition, use nx_rmpart(). The parameters of nxjrmpartO have the following 
meanings: 

partition The relative or absolute pathname of the partition to be removed. The parent 

partition of the specified partition must give write permission to the calling 
process. See “Partition Pathnames” on page 2-28 for more information on 
partition pathnames; see “Owner, Group, and Protection Modes” on page 
2-32 for more information on partition permissions. 

force Specifies whether to remove the partition if it contains running applications: 

if force is 0, the partition will not be removed if it contains any applications; 
if force is any value other than 0, the partition will be removed even if it 
contains applications. 

recursive Specifies whether to remove the partition if it contains subpartitions: if 

recursive is 0, the partition will not be removed if it contains any 
subpartitions; if recursive is any value other than 0, the partition will be 
removed along with all its subpartitions, sub-subpartitions, and so on. This is 
an “all or nothing” operation: if any subpartitions cannot be removed, the call 
fails and no subpartitions are removed. 
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If the partition contains both subpartitions and applications, or contains subpartitions that contain 
applications, you must set both/orce and recursive to a nonzero value to remove it. 

nx_rmpart() returns 0 for success, or -1 if any error occurs. 

For example, the following Fortran call removes the partition called newpart whose parent partition 
is the compute partition (using a relative partition pathname), but only if it does not contain any 
running applications or subpartitions: 

include 'fnx.h' 
integer n 

n = nx_rmpart("newpart", 0, 0) 

After this call, the variable n contains 0 if the partition was removed, or -1 if it was not removed for 
any reason (for example, if the partition contained applications or subpartitions). 

The following C call removes the partition called mypart whose parent partition is the compute 
partition (using an absolute partition pathname), even if it contains running applications; however, 
it does not remove mypart if the partition contains subpartitions: 

#include <nx.h> 
int n; 

n = nx_rmpart(".compute.mypart", 1, 0); 

After this call, the variable n contains 0 if the partition was removed, or -1 if it was not removed for 
any reason (for example, if the partition contained subpartitions, or if the partition does not exist). 


Getting Information About Partitions 


Synopsis Description 

nx_part_attr(partition, attributes) Get a partition’s attributes. 

nx_part_nodes(parri«on, nodejist, listjsize ) List the root node numbers for the nodes of a 

partition. 


To get information about a partition, use nx_part_attr() or nx_part_nodes(). nx_part_attrO 
returns the attributes of a partition, and nx_part_nodes() returns a list of the nodes in the specified 
partition. 
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NOTE 

Do not call nx_part_attr() or nx_part_nodesO on more than a few 
nodes at once. 


If many nodes use the partition information calls at the same time, the allocator daemon can become 
overwhelmed with requests, which could slow down your application or reduce system stability. If 
all the nodes in your application need this information, you should have one node make the call and 
then distribute die information to the other nodes. 


Determining a Partition’s Attributes with nx_part_attr() 

nx_part_atfr() returns the attributes of a partitioa It has the following parameters: 

partition The relative or absolute pathname of the partitioa The specified partition 

must exist and must give read permission to the calling process. 

attributes A structure of type nx_part_info_t (you must allocate the space for this 

structure). The nx jpartjnfoj structure is defined in allocsys.h, which is 
automatically included by nx.h and fnx.h. It includes the following elements: 

uid The numeric user ID of the partition’s owner. 

gid The numeric group ID of the partition’s group. 

access The access permissions of the partition, expressed as a 

three-digit octal number. 

sched The scheduling type of the partitioa NX_STD, 

NX GANG, or NX_SPS. (These are constants 
defined in nx.h or fnx.h). 

rq The rollin quantum of the partitioa expressed as an 

integer number of milliseconds (0 for a 
standard-scheduled or space-shared partition). 

epl The effective priority limit of the partition (20 for a 

standard-scheduled partition). 

nodes The number of nodes in the partitioa 

meshjc The width of the partition (columns), or -1 if the 

partition is not rectangular. 
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mesh_y The height of the partition (rows), or -1 if the partition 

is not rectangular. 

enclose jneshjc The width of the smallest rectangle that completely 
encloses the partition. 

enclosejnesh_y The height of the smallest rectangle that completely 
encloses the partition. 

nxjpart_attr() returns 0 for success, or -1 if any error occurs. 

For example, the following C program fragment prints the rollin quantum and effective priority limit 
for the partition myparr. 

#include <nx.h> 

nx_part__info__t info; 
int status; 

status = nx__part_attr("mypart" , &info); 

if(status != 0) { 

nx_perror("nx_part_attr()"); 
exit(1); 

} 

printf(”rq = %d, epl = %d\n", info.rq, info.epl); 

Note the use of the & operator on the structure info in the call to nx_part_attr(). The equivalent 
Fortran code is as follows: 

include ’fnx.h’ 

record /nx_part_info_t/ info 
integer status 

status = nx_part_attr("mypart", info) 

if(status .ne. 0) then 

call nx_perror( M nx_part_attr()") 
stop 
end if 

print *, "rq =",info.rq,", epl =”,info.epl 


4-33 



Using Other Paragon™ OSF/1 System Calls 


Paragon™ User’s Guide 


If the partition is not a contiguous rectangle, the values of meshjc and meshj are -1 and the 
rectangle described by enclose jneshjc and enclose jnesh_y includes nodes that are not part of the 
partition. For example, Figure 4-1 shows a non-rectangular partition called mypart. For this 
partition: 

• nodes is 4. 

• meshjc and meshj are both -1. 

• enclose meshjc is 3. 

• enclose jnesh_y is 2. 
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Figure 4-1. Sample Partition for nx_part_attrQ and nx_part_nodes() 
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Determining a Partition’s Nodes with nx_part_nodesO 

nxjpart_itodes() returns a list of the nodes in the specified partition. You might want to do this to 
determine whether or not the partition includes a certain node which has special hardware 
characteristics such as extra memory or an I/O interface, nx j>art_nodesO has the following 
parameters: 

partition The relative or absolute pathname of the partition. The specified partition 

must exist and must give read permission to the calling process. 

node list Pointer variable into which nx jjart_nodes() stores the address of the list of 

nodes. nx_part_nodes() allocates the memory for this list; when you are 
finished using the information, you should release this memory by calling 
free(). 

listjsize Variable into which nxjjartjnodesO stores the number of entries in 

nodejist . 

nx_part_nodes() returns 0 for success, or -1 if any error occurs. 

The node numbers returned by nxjpart_nodes() are node numbers from the root partition. For 
example, nxjpartnodesO for the partition mypart shown in Figure 4-1 would return node numbers 
6, 7,12, and 13. This is true even if the root partition is not the direct parent partition of mypart . 

For example, the following Fortran program fragment prints the root node numbers for the partition 
mypart. 

include 'fnx.h' 

integer*4 mynodes(1) 

pointer (ptr, mynodes) 

integer nnodes 

integer i, status 

status = nx_part_nodes("mypart”, ptr, nnodes) 

if(status .ne. 0) then 

call nx_perror("nx_part_nodes()") 
stop 
end if 

do 2, i = l f nnodes 
print *, mynodes(i) 

2 continue 

call free(ptr) 
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The equivalent C code is as follows: 

#include <nx.h> 

nx__nodes_t my nodes; 

unsigned long nnodes; 
int i, status; 

status = nx__part_nodes ( "mypart", &mynodes, &nnodes); 

if(status 1=0) { 

nx_perror ( " nx__part_nodes () ” ) ; 
exit(1); 

} 

for(i =0; i < nnodes; i++) { 
printf(”%d\n", mynodes[i]); 

} 

free(mynodes); 

Note the use of the & operator on the variables mynodes and nnodes in the call to nxjpart_nodes(). 


Changing Partition Characteristics 


Synopsis 

Description 

nx_chpart_name(parririon, name) 

Change a partition’s name. 

nx_chpart_mod(parrift'on, mode) 

Change a partition’s protection modes. 

nx_chpart_epl(parririo«, priority) 

Change a partition’s effective priority limit. 

nx_chpart_rq(parririon, rollinquantum) 

Change a partition’s rollin quantum. 

nx_chpart_owner(pamt/o«, owner, group) 

Change a partition’s owner and group. 

nx_chpart_sched(parririon, schedjype) 

Change a partition’s scheduling type. 


To change a partition’s characteristics, use nx_chpart_name(), nx_chpart_mod(), 
nx_chpart_epl(), nxchpartrqO, nx_chpart_owner(), or nx_chpart_sched(). Each of these calls 
changes one characteristic, and leaves the other characteristics unchanged. These calls have the 
following parameters: 

partition The relative or absolute pathname of the partition to change. The specified 

partition must exist; the permissions required depend on the operation. 
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name (nx_chpart_name() only) 

The new name for the partition, expressed as a string of any length containing 
only uppercase letters, lowercase letters, digits, and underscores. Note that 
nx_chpart_name() can only change the partition’s name “in place;” there is 
no way to move a partition to a different parent partition. 

The calling process must have write permission on the parent partition of the 
specified partition to use nx_chpart_name(). 

mode (nx_chpart_mod() only) 

The new protection modes of the partition, expressed as an octal number. See 
chmodO in the OSF/1 Programmer’s Reference for more information on 
specifying protection modes; see “Owner, Group, and Protection Modes” on 
page 2-32 for more information on protection modes for partitions. 

The calling process must be the owner of the partition or the system 
administrator to use nxchpartmodO- 

priority (nx_chpart_epl() only) 

The new effective priority limit for the partition, expressed as an integer from 
0 to 10 inclusive. See “Scheduling Characteristics” on page 2-33 for more 
information on effective priority limits. 

The calling process must have write permission for the partition to use 
nx_chpart_epl(). 

rollin_quantum (nx_chpart_rq() only) 

The new rollin quantum for the partition, expressed as an integer number of 
milliseconds, or 0 to specify an “infinite” rollin quantum. The specified value 
must not be greater than 86,400,000 milliseconds (24 hours) and must not be 
less than the minimum rollin quantum for your system (determined by your 
system administrator). If it is not a multiple of 100, it is silently rounded up 
to the next multiple of 100. See “Scheduling Characteristics” on page 2-33 for 
more information on rollin quanta. 

The calling process must have write permission for the partition to use 
nx_chpart_rq(). 

owner and group (nxchpartownerO only) 

The new user and group for the partition, expressed as a numeric user ID 
(UID) and group ID (GID). You can also specify -1, meaning “leave 
owner/group unchanged,” for either or both. See “Owner, Group, and 
Protection Modes” on page 2-32 for more information on partition 
ownership. 
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The permissions required for nx chpart ownerf) depend on the operation. 
To change the partition’s ownership, the calling process must be the system 
administrator. To change the partition’s group, the calling process must either 
be the system administrator or must be the partition’s owner and changing the 
group to a group that the calling process belongs to. 

schedjype (nx_chpart_sched() only) 

The new scheduling type for the partition, which must be NX_GANG or 
NX_SPS (constants defined in nx.h or fnx.h). See “Scheduling 
Characteristics” on page 2-33 for more information on gang-scheduling and 
space sharing. 

The specified partition must not be standard-scheduled. A space-shared 
partition can be changed to gang-scheduled at any time; a gang-scheduled 
partition can only be changed to space-shared if it contains no applications 
and no overlapping subpartitions. 

The calling process must have write permission for the partition to use 

nx_chpart_sched(). 

nx_chpart_name(), nx_chpart_mod(). nxchpartepK). nx_chpart_rq(), nx_chpart_owner(), 
and nx_chpart_sched() return 0 for success, or -1 if any error occurs. 

For example, the following Fortran call changes the name of mypart to newparr. 

include 'fnx.h' 
integer n 

n = nx_chpart_name("mypart", "newpart") 

The following C call has the same effect, but uses an absolute partition pathname: 

#include <nx.h> 
int n; 

n = nx_chpart_name(".compute.mypart", "newpart"); 

Note that the second parameter of nx_chpart_name() is always a partition name, never a partition 
pathname. There is no way to move a partition from one parent partition to another. 

The following C call sets the permissions of mypart to rwxr-x — (750 octal): 

#include <nx.h> 
int n; 

n = nx_chpart_mod("mypart", 0750); 
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The following Fortran call has the same effect, but uses an absolute partition pathname: 

include *fnx.h' 
integer n 

n = nx_chpart_mod (" .compute.mypart ”, * 7 5 0 1 0) 

The following C call sets mypart' s effective priority limit to 7: 

#include <nx.h> 
int n; 

n = nx_chpart_epl("mypart”, 7); 

The following Fortran call sets mypart' s rollin quantum to 10 minutes (600,000 microseconds): 

include *fnx.h’ 
integer n 

n = nx_chpart_rq("mypart", 600000) 

The following C calls set mypart' s owner to jred and its group to devel (see the OSF/1 
Programmer’s Reference for information on getpwnamO and getgrnam(), which get the numeric 
user and group IDs based on their names): 

#include <stdio.h> 

#include <pwd.h> 

#include <grp.h> 
tinclude <nx.h> 

struct passwd *user; 
struct group *group; 
int n; 

user = getpwnam( "fred" ); 
group = getgrnam("devel"); 

n = nx_chpart_owner( "mypart", user->pw_uid, group->gr_gid) ; 

The following Fortran call changes mypart to a gang-scheduled partition (it must currently be either 
gang-scheduled or space-shared): 

include 'fnx.h 1 
integer n 

n = nx_chpart_sched("mypart", NX_GANG) 

In each of these examples, the variable n is assigned 0 if the call succeeded, or -1 if any error 
occurred. 
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Listing Unusable Nodes 

Synopsis 

nx_empty_nodes(/K«fe_fe, list_size ) 
nx_failed_nodes(/K*fe_for, list_size ) 


Description 

list the nodes that are empty slots. 
List the nodes that failed to boot. 


To find out which nodes in the system are unusable, use nxemptvnodesO and nx_failed_nodes(). 
(See “Unusable Nodes” on page 2-31 for more information on unusable nodes.) 

• nx empty nodesO returns a list of the nodes that are part of the root partition but do not have 
a node board installed in the corresponding slot (these are shown as in the output of 

showpart). 

• nx_failed_nodes() returns a list of the nodes that are part of the root partition but failed to boot 
for some reason (these are shown as “X” in the output of showpart). 


NOTE 

Do not call nx_empty_nodesO or nx_failed_nodesO on more 
than a few nodes at once. 


If many nodes use these calls at the same time, the allocator daemon can become overwhelmed with 
requests, which could slow down your application or reduce system stability. If all the nodes in your 
application need this information, you should have one node make the call and then distribute the 
information to the other nodes. 

Both these calls have the following parameters: 

node list Pointer variable into which the call stores the address of the list of nodes. The 

call allocates the memory for this list; when you are finished using the 
information, you should release this memory by calling free(). 

listjize Variable into which the call stores the number of entries in node list. 

The node numbers returned by these calls are node numbers from the root partition. Both calls return 
0 for success, or -1 if any error occurs. 
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For example, the following Fortran program fragment prints the node numbers of all empty slots in 
the root partition: 

include ’fnx.h' 


integer*4 
pointer 
integer 
integer 


empty(1) 

(ptr, empty) 
nempty 
i, status 


status = nx_empty_nodes(ptr, nempty) 

if(status . ne. 0) then 

call nx_perror("nx_empty_nodes()") 
stop 
end if 


do 2, i = 1, nempty 
print *, empty(i) 

2 continue 

call free(ptr) 

The following C program fragment prints the node numbers of all nodes in the root partition that 
failed to boot: 

#include <nx.h> 


nx_node s_t f ailed; 

unsigned long nfailed; 
int i, status; 

status = nx_failed_nodes(&failed, &nfailed); 

if(status 1=0) { 

nx_perror ("nx__failed_nodes () " ); 
exit(1); 

} 

for(i = 0; i < nfailed; i++) { 
printf("%d\n", failed[i]); 

} 

free(failed); 

Note the use of the & operator on the variables failed and nfailed in the call to nx_failed_nodes(). 
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Handling Errors 


Synopsis Description 

_callQ Special version of call that returns error value to 

caller (C only). 

nx_perror(,y/n'ng) Print an error message corresponding to the 

current value of ermo. 


When an error occurs in a standard OSF/1 system call, the call indicates the error in one of two ways, 
depending on the error. For most errors, the call returns -1 and sets the variable errno to a value that 
describes the error. For certain severe errors (such as a segmentation violation caused by an invalid 
pointer parameter), the call sends a signal to the calling process; this signal may result in a core 
dump, as discussed under “Core Dumps” on page 4-44. 

When an error occurs in a Paragon OSF/1 system call whose name begins with nx_, it uses the same 
two techniques as a standard OSF/1 system call. However, when an error occurs in a Paragon OSF/1 
system call that is not a standard OSF/1 system call and whose name does not begin with nx_, the 
error is handled differently: the system prints a message on the terminal and terminates the calling 
process. (There are exceptions; see the manual page for each call in the Paragon ™ C System Calls 
Reference Manual or Paragon ™ Fortran System Calls Reference Manual for details.) If you 
program in C, you can get the same behavior as the nx_ calls by calling the underscore version of 
the call. (Fortran does not have underscore versions.) 


Underscore Calls 

The underscore version of a Paragon OSF/1 system call is the same as the standard version except 
that it has an underscore added to the beginning of its name. For example, _crecv() is the underscore 
version of crecvO. The underscore version returns -1 if the call encounters an error and 0 or a 
positive value if the call is successful. 

If an error occurs, the underscore version also sets the system variable ermo to indicate the cause of 
the error. The include file ermo.h declares ermo for you and defines constants for the possible ermo 
values. For example, if crecvO receives a message that is larger than the size specified by its len 
parameter, an error message appears and the application terminates. If you use _crecv() instead, this 
does not occur; instead, the call to _crecv() returns -1 and the variable ermo is set to the value 
EQMSGLONG. 
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There is a standard error message for each value of ermo, which you can print out by calling 
nxjerrorO. nx_perror() prints its argument (any string), the current node number and process 
type, and the error message associated with the current value of ermo to the standard error output in 
the following format: 

(node n, ptype p) string: error__message 

Suppose you have a program where the user can specify the size of a certain buffer with a 
command-line argument. If a message is received that is too long for this buffer, you would like to 
be able to tell the user what happened and suggest that they increase the buffer size. The following 
example uses the underscore version of crecv() to do this: 

#include <nx.h> 
tinclude <errno.h> 

char *transbuf; 
int transbuf__size; 


if(_crecv(l, transbuf, transbuf_size) == -1) { 
if(errno == EQMSGLONG) { 

/* received message too long for buffer */ 
printf("Message exceeded transit buffer size!\n”); 
printf(”Use -t to specify a larger transit buffer.\n"); 
exit(1); 

} else { 

/* some other error, print a standard error 
message and exit*/ 
nx__perror( "crecv" ) ; 
exit(1); 

} 

} 
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Core Dumps 

When certain severe errors occur in a standard OSF/1 system call or a Paragon OSF/1 system call 
whose name begins with nx_, the operating system sends a signal to the calling process. The default 
action for the following signals is to cause a core dump: 

SIGABRT (also called SIGIOT and SIGLOST) 

Abort process (can be generated by the abortO system call). 

SIGBUS Bus error (specification exception). 

SIGEMT EMT instruction. 

SIGFPE Floating point exception. 

SIGILL Illegal instruction. 

SIGQUIT Quit (can be generated by the user typing < Ctrl -\ > on the terminal). 

SIGSEGV Segmentation violation. 

SIGSYS Bad argument to system call. 

SIGTRAP Trace trap. 

A core dump means that the process terminates immediately and writes a copy of its current memory 
contents to a file called core in the current working directory. You can prevent this default action by 
establishing a signal handler for the desired signal; see signalO in the OSF/1 Programmer’s 
Reference for information on signal handlers. 

NOTE 

No tools are currently provided for analyzing core files or 
debugging with core files. 
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If one process in an application dumps core, it may or may not terminate the rest of the application, 

as follows: 

• If the application was linked with the -nx switch, when one process of the application is 
terminated by the signal SIGBUS, SIGFPE, SIGILL, SIGSEGV, or SIGSYS the whole 
application is terminated. 

• If the application was not linked with -nx but does call nx waitallO, if nx_waitall() detects that 
one of the processes being waited for has been terminated by the signal SIGBUS, SIGFPE, 
SIGILL, SIGSEGV, or SIGSYS, then nx_waitallO terminates the whole application by 
sending a SIGKILL to the process group. 

• In any other case (that is, if a process dumps core because of SIGABRT, SIGEMT, SIGQUIT, 
or SIGTRAP, the application was not linked with -nx, or the process that dumped core is not 
being waited for by nx_waitall()), the other processes in the application are not directly affected 
by one process dumping core. However, if all the processes in the application are running the 
same code, all processes may dump core independently. If several processes in an application 
dump core, they will all write their core dumps to the same core file unless the processes have 
changed to different working directories. 
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Controlling Floating-Point Behavior 


Synopsis 

Description 

isnan(<&rc) 

Determine if a double value is Not-a-Number 
(C only). 

isnand(dsrc) 

Determine if a double value is Not-a-Number 
(C only). 

isnanf(/jrc) 

Determine if a float value is Not-a-Number 
(C only). 

fpgetroundO 

Get the floating-point rounding mode for the 
calling process (C only). 

fpsetround(ra*/jftr) 

Set the floating-point rounding mode for the 
calling process (C only). 

fpgetmaskO 

Get the floating-point exception mask for the 
calling process (C only). 

fpsetmask(may&) 

Set the floating-point exception mask for the 
calling process. 

fpgetstickyO 

Get the floating-point exception sticky flags for 
the calling process (C only). 

fpsetstickyfrfzcAy) 

Set the floating-point exception sticky flags for 
the calling process (C only). 


Paragon OSF/1 supports a series of floating-point control calls compatible with those of UNIX 
System V. 


NOTE 

Only fpsetmaskO is available to Fortran programs. The other 
floating-point control calls are available only to C programs. 
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Detecting Not-a-N umber 

The calls isnanO, isnandO. and isnanfO are used to determine whether a floating-point value is an 
IEEE NaN, or “Not-a-Number.” This value can be generated as a result of certain floating-point 
mathematical operations and system calls, when the operands are invalid or out of range. isnan() and 
isnandO take an argument of type double, and isnanfO takes an argument of type float. (isnanO 
and isnandO are identical except for the name.) All three calls return 1 if the argument is a NaN, and 
0 otherwise. 


NOTE 

These calls never generate an exception, even if the argument is 
a NaN. 


Controlling Floating-Point Behavior 

The calls fpgetroundO, fpsetroundO, fpgetmask(), fpsetmask(). fpgetstickyO, and fpsetstickyO 
get and set the i860 microprocessor’s floating-point control registers. The values of these registers 
are part of the process, and are saved and restored when the process is swapped in and out 

The get calls simply return the current value of the specified register for the calling process; the set 
calls set the register to the specified value for the calling process and return its previous value. 


Rounding Mode 

fpgetroundO and fpsetroundO get and set the i860 microprocessor ’s floating-point rounding mode, 
which determines what happens when a floating-point value generated in a calculation cannot be 
represented exactly. 

The i860 microprocessor has four rounding modes: 

FP RN Round to nearest representable number (if two representable numbers are 

equidistant, round to the even one). 

FP_RM Round toward minus infinity. 

FP RP Round toward plus infinity. 

FP RZ Round toward zero (truncate). 

These symbolic names are the values of the enum type fprnd, which is declared in <ieeefp.h>. 
The argument of fpsetroundO and the return values of fpsetroundO and fpgetroundO are of this 
type. 
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NOTE 

When you convert a floating-point value to an integer type in C, it 
always truncates. The processor’s rounding mode is ignored. 


Exception Mask and Sticky Flags 

fpgetstickyO and fpsetstickyO get and set the i860 microprocessor’s floating-point exception sticky 
flags, and fpgetmask() and fpsetmaskO get and set the floating-point exception mask. 

The i860 microprocessor defines five floating-point exceptions: 


FPXINV 

Invalid operation exception. 

FPXDZ 

Divide-by-zero exception. 

FPXOFL 

Overflow exception. 

FP_X_UFL 

Underflow exception. 

FPXIMP 

Imprecise (loss of precision) exception. 


These symbolic names are the values of the enum type fp_except, which is declared in <ieeejp.h>. 
The arguments of fpsetstickyO and fpsetmaskO and the return values of fpgetstickyO, 
fpsetstickyO, fpgetmaskO, and fpsetmaskO are of this type. 

The i860 microprocessor has five exception sticky flags and five exception mask bits corresponding 
to the five exception types. When a floating-point exception occurs, the corresponding exception 
sticky flag is set to 1. The corresponding exception mask bit is then examined; if it is set to 1, the 
exception is trapped and the appropriate trap handler is called. 


NOTE 

After an exception, you must clear the corresponding sticky flag to 
recover from the trap and proceed. 


If the sticky flag is not cleared before the next floating-point exception occurs, an incorrect exception 
type may be signaled. For the same reason, when you call fpsetmaskO, you must be sure that the 
sticky flag corresponding to each exception being enabled is cleared. 
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NOTE 

fpsetstickyO and fpsetmaskO set the sticky flags and exception 
mask to the specified values. Any bits not set in the call’s 
argument are cleared. 


To set or clear a particular mask or sticky flag, get the current mask or sticky flags, modify it, and 
then call fpsetstickyO or fpsetmaskO with the modified mask or sticky flags. 


Fortran Exception Mask Values 

Only the fpsetmaskO call is supported in Fortran. You use the following numeric values with 
fpsetmaskO: 

0 No exceptions. 

1 Invalid operation exception. 

2 Divide-by-zero exception. 

4 Overflow exception. 

8 Underflow exception. 

16 Imprecise floss of precision) exception. 

The argument and return value of fpsetmaskO are integers whose values are the sum of some, none, 
of all of these values. 
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Miscellaneous Calls 


Synopsis 

Description 

flickO 

Temporarily relinquish the CPU to another 
process. 

dclockO 

Return time in seconds since booting the system. 


Temporarily Releasing Control of the Processor 

The flick() call temporarily releases control of the node processor to another process in the same 
application. If there are no other processes in the same application when a process calls flick(), 
control returns to the Paragon OSF/1 operating system. For example, if your node program has set 
up a number of hrecv()’s and has nothing else to do, it should issue flickO- The operating system 
can then more efficiently respond to an incoming message and wake up your process. 

flickO does not have any effect on rollin and rollout of the application (see “Gang Scheduling” on 
page 2-35 for information on rollin and rollout). 


Timing Execution 

dclock() returns the time in seconds since the system was last booted, as a double precision number. 
This time is obtained from the RPM global clock and is the same on every node. 

Use dclock() to return a relative value that you can use to measure execution time. To time an 
interval in your program, first obtain an initial value. Then obtain a final value and take the 
difference. The actual values returned by the two dclock() calls are not important. 

Here is an example that shows how to use dclock() to time the execution of an iteration loop: 

/* C version */ 

double start_time, end_time, diff_time; 
start__time = dclock(); 
for(i=0;icimax;i++) { 


} 

end_time = dclock(); 

diff_time = end_time - start_time; 

printf("Timing = %e\n", diff_time); 
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c Fortran version 

double precision start__time, end_time, diff_time 
start_time = dclock() 
do 100 i=l, imax 


100 continue 

end_time = dclock() 
diff_time = end_time - start_time 
write(*, 10) diff_time 
10 format('diff_time = D15.9) 
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iPSC® and Touchstone DELTA Compatibility Calls 


Synopsis 

Description 

flushmsg(rypese/, nodesel, ptypesel) 

Flush specified messages from the system. 

gjnv(/) 

Return the position of an element in the 
binary-reflected gray code sequence. Inverse of 
grayO- 

grayO) 

Return the binary-reflected gray code for an 
integer. 

hwclock(Awft'me) 

Place the current value of the hardware counter 
into a 64-bit unsigned integer variable. 

infopidO 

Return the process type of the process that sent a 
pending or received message. 

killcubefno^te, ptype) 

Terminate and clear node process(es). 

killproc(/M*fe, ptype) 

Terminate a node process. 

\eA{state) 

Does nothing; provided for compatibility only. 

load (filename, node, ptype) 

Load a node process. 

mclockO 

Return the time in milliseconds. 

msgcancel(mM) 

Cancel an asynchronous send or receive 
operation. 

mypart(rowi, cols) 

Obtain the height and width of the rectangle of 
nodes allocated to the current application. 

mypidO 

Return the process type of the calling process. 

nodedimO 

Return the dimension of the current application 
(the number of nodes allocated to the application 

• „ dimension, 

IS 2 ). 

restrictvol(/i/e/£>, nvol, vollist) 

Does nothing; provided for compatibility only. 


The calls flushmsgO, ginv(), grayO, hwclock(), infopidO, killcube(), killprocO, led(), loadO, 
mclockO, msgcancelO, mypartO, mypid(), nodedimQ, and restrictvol() are provided for 
compatibility with the iPSC series of supercomputers and Touchstone DELTA system from Intel 
Corporatioa 
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These calls should not be used in new Paragon OSF/1 programs. They either provide the same 
functionality as Paragon OSF/1 calls (for example, mypidO is identical to myptypeO but uses the 
iPSC system terminology), or provide functionality that is not needed in Paragon OSF/1 (for 
example, grayO is not useful in a machine without a hypercube architecture). 

These calls work the same as the corresponding calls on the iPSC or Touchstone DELTA system, 
with the following exceptions: 

• flushmsgO does nothing. 

• The only valid use of kiUcubeO is killcube(-l,-l). 

• The only valid use of klUprocO is kiIlproc(-l,-l). 

• ledO does nothing. 

• load() must be preceded by nx_initve() (it is equivalent to nx_load() but does not let you 
specify a list of nodes or find out the PIDs of the loaded processes). 

• msgcancelO does nothing. 

• If numnodesO is not a power of 2, nodedim() rounds it up to the next power of 2 and returns 
the dimension of a cube of that size. For example, if numnodesO is 7, nodedimO returns 3; if 
numnodesO is 9, nodedimO returns 4. 

• restrictvolO does nothing. It always returns 0 (indicating success). 

See your iPSC or Touchstone DELTA system documentation for more information on these calls. 
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Introduction 

The Paragon™ OSF/1 operating system provides two forms of parallel I/O to files: 

• A special file system type called PFS, for Parallel File System, gives applications high-speed 
access to a large amount of disk storage. PFS file systems are optimized for simultaneous access 
by multiple nodes. Files in PFS file systems can be very large (up to several terabytes); the exact 
maximum depends on your system configuration. Access to PFS file systems also uses an I/O 
technique calledjfcsr path I/O, which gives superior performance for large I/O operations (64K 
bytes or more per read or write). 

• Special I/O system calls, called parallel HO calls, facilitate I/O from multiple nodes and permit 
I/O to very large files in PFS file systems. These calls can give applications better performance 
and more control over parallel file I/O than is offered by the standard C and Fortran file I/O 
features. These calls are compatible with the Concurrent File System™ (CFS™) calls provided 
by the iPSC® system. 

A system running Paragon OSF/1 can have both PFS and non-PFS file systems. You can access files 
in PFS file systems with both parallel I/O calls and non-parallel I/O calls; you can use parallel I/O 
calls to access files in both PFS file systems and non-PFS file systems. However, in most cases you 
get the best performance when you use parallel I/O calls to access files in PFS file systems. 

This chapter discusses both PFS file systems and parallel I/O calls. It also gives information on 
performing operations on tape devices in Paragon OSF/1. For information on getting the best 
performance from PFS file systems and parallel I/O calls, see “I/O Performance” on page 8-23. 
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Disks and File Systems 

Every Intel supercomputer has one or more disk devices attached to it. Each disk device is either a 
single hard disk or a RAID subsystem. RAID stands for Redundant Array of Inexpensive Disks; in a 
RAID subsystem, several hard disks are connected together into a unit that appears to the system as 
a single large disk drive. Files stored to a RAID subsystem are distributed, or striped, among the 
disks within it by the RAID controller hardware. 

Each disk device is controlled by an I/O node: a compute node with an I/O connection. I/O nodes 
communicate with the other nodes in the system using the node-to-node message-passing network 
and with the disk drives using a SCSI interface (or other interface). The I/O nodes may or may not 
also run application processes; this is determined by your system administrator. Each I/O node can 
control up to seven disk devices, and the number of I/O nodes is limited only by the number of slots 
in the system, so the total amount of disk space that could be installed in an Intel supercomputer is 
a terabyte or more. 

The set of disk devices connected to the Intel supercomputer’s I/O nodes is divided into file systems. 
A file system can encompass anything from a portion of the space on one disk device to all of the 
space on several disk devices. A file system is made accessible by mounting it to a directory (this 
requires system administrator privileges). This directory is called the file system’s mount point. For 
example, if the file system Idev/ioOIrzOfis mounted on the directory / home (the directory /home is 
the file system’s mount point), whenever you write a file in /home it is stored in the file system 
IdevlioOIrzOf. 

Each file system has a type that describes its internal structure and determines some of the operations 
that can be performed on it. The supported file system types are: 

UFS UNIX File System, the standard file system type for OSF/1. 

NFS Network File System, a file system type that represents a file system on 

another computer on the network. 

PFS Parallel File System, a file system type that is optimized for access by parallel 

processes. This file system type is unique to Paragon OSF/1. 

This chapter discusses how PFS file systems work and how you can use the parallel I/O system calls 
provided by Paragon OSF/1 to access files in file systems of all types. 
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PFS File Systems and PFS Files 

Internally, a file system of type PFS consists of one or more stripe directories. The stripe directories 
that make up a PFS file system are determined by the system administrator when the PFS file system 
is mounted. 

Each stripe directory is usually the mount point of a separate UFS file system. Just as a RAID 
subsystem collects together several hard disks into a unit that behaves like a single large disk, a PFS 
file system collects together several file systems into a unit that behaves like a single large file 
system. A system running Paragon OSF/1 can have any number of PFS file systems. 

The maximum storage capacity of a PFS file system is the stun of the capacities of the different file 
systems containing its stripe directories. For example, if a PFS file system consists of four stripe 
directories, each of which is the mount point of a UFS file system with a capacity of 100M bytes, 
the capacity of the PFS file system is 400M bytes. However, if another PFS file system also consists 
of four stripe directories, but two of them are directories in one UFS file system with a capacity of 
100M bytes and the other two are directories in another UFS file system with a capacity of 100M 
bytes, the capacity of the PFS file system is only 200M bytes. 

A PFS file is any ordinary file that is stored in a file system of type PFS. PFS files are distributed, 
or striped , across the stripe directories that make up the PFS file system. The amount of data from a 
PFS file that is stored in each stripe directory is determined by the PFS file system’s stripe unit, a 
quantity that is set by the system administrator when the PFS file system is mounted. The maximum 
size of a file in a PFS file system is roughly 2G bytes times the number of file systems in the PFS 
file system. 1 

For example, suppose a PFS file system consists of four stripe directories and has a stripe unit of 4K 
bytes. When you write a 20K-byte file to this PFS file system, the first 4K bytes of the file are stored 
in the first stripe directory, the second 4K bytes in the second stripe directory, the third 4K bytes in 
the third stripe directory, the fourth 4K bytes in the fourth stripe directory, and the last 4K bytes back 
in the first stripe directory. 

Objects in PFS file systems that are not ordinary files (objects such as directories, symbolic links, 
and device special files) are not striped; each such object exists on just one disk. 


PFS Filenames and Pathnames 

Filenames and pathnames in PFS file systems work the same as pathnames in UFS file systems. The 
maximum length of a pathname is 1024 characters; the maximum length of a single filename is 255 
characters. 


1. The exact maximum size is given by the formula ((((2G -1) - r) x n) + r), where r is 
(2G -1) mod stripejmit (that is, the remainder when the largest integer multiple of the stripe unit that is less 
than 2G -1 is subtracted from 2G - 1) and n is the number of different file systems containing the PFS file 
system’s stripe directories. 
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PFS Limitations 

In the current release, PFS file systems and parallel I/O calls have the following limitations: 

• PFS files cannot be accessed from a remote system via NFS. 

• PFS does not support executable files. If you copy a binary file to a PFS file system and try to 
execute it, an “Operation not supported by this file system” error occurs. 

• PFS does not support core files. If a core dump occurs while your current directory is in a PFS 
file system, a core file of length 0 is created. 

• PFS does not support the quotaon or sysacct commands or the mmapO system call. 

• PFS file regions cannot be locked by the fcntl() system call. However, you can use the flock() 
system call to lock the entire file. 

• The maximum number of open files per process at any given time is 64. This includes the 
standard input, standard output, and standard error. This means that there is a practical 
maximum of 61 open files per process. 
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Using PFS Commands 

In general, you use standard OSF/1 commands such as Is, cat, cp, and mv to manipulate files in PFS 
file systems. See the OSFI1 Command Reference for information on these commands. (Many 
commands do not work with files larger than 2G - 1 bytes, as described under “Using Extended 
Files” on page 5-33.) This section describes the additional file and file system commands provided 
by Paragon OSF/1. 


Displaying File System Attributes 


Command Synopsis Description 

showfs [ -k ] [ -t type ] [filesystem I directory ] Display file system attributes. 


The command showfs with no arguments lists the file systems on your system, together with 
information on each. For example: 


% showfs 

Mounted on 

512-blks 

avail 

capacity 

sunit sfactor 

/ 

1458308 

719276 

45% 


/home 

4060838 

3373782 

8% 


/usr 

2379194 

1948124 

9% 


/home/.sdirs/volO 

598622 

574464 

4% 


/home/.sdirs/voll 

598622 

574464 

4% 


/home/.sdirs/vol2 

598622 

574464 

4% 


/home/.sdirs/vol3 

598622 

574464 

4% 


/pfs 

2394488 

2297856 

4% 

8192 4 

sdirectories: 

/home/.sdirs/volO 
/home/.sdirs/voll 
/home/.sdirs/vol2 
/home/.sdirs/vol3 




In this case, the system has eight file systems. The seven file systems mounted on the directories 
/ (root), /home, /usr, lhomel.sdirslvolO, Ihome/.sdirslvoll , lhomel.sdirslvol2, and lhomel.sdirslvol3 
are non-parallel file systems (type UFS or NFS); the file system mounted on the directory /pfs is a 
PFS file system. 


NOTE 

There’s nothing special about the name /pfs, your PFS file 
systems can have any name. However, the rest of this chapter 
uses the convention that pathnames beginning with /pfs are in a 
PFS file system. 
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The showfs command shows the following information for every file system: 

Mounted on The directory where the file system is mounted (its mount point). If you need 
to know the file system’s device name, use the standard OSF/1 command 

mount or df. 

512 -blks The total capacity of the file system in 512-byte disk blocks, 
avail The number of disk blocks currently available. 

capacity The approximate percentage of the file system’s capacity currently in use. 

In this example, the file system mounted on lusr has a sire of 2,379,194 512-byte disk blocks, of 
which 1,948,124 blocks are currently unused, so that the file system is approximately 9% full. 

The showfs command shows the following additional information for each PFS file system: 

sunit The file system’s stripe unit, in bytes. 

sf actor The number of stripe directories within the PFS file system. 

sdirectories The stripe directories (usually mount points of UFS file systems) within the 
PFS file system. 

In this example, the PFS file system mounted on Ipfs has a stripe unit of 8K bytes and consists of the 
four UFS file systems mounted on IhomeLsdirsIvolO, IhomeLsdirsIvoll, lhomel.sdirslvol2, and 
lhomel.sdirslvol3. 

The showfs command accepts the following optional arguments: 

-k Display capacity and available capacity in 1024-byte disk blocks instead of 

512-byte disk blocks. The header “512-blks” changes to “kbytes”. 

-t type Display information about all file systems of type type, where type is any 

recognized file system type in lowercase (pfs, ufs, or nfs). 

filesystem Display information about the file system whose device name is filesystem. 

directory Display information about the file system mounted on directory. 

The filesystem or directory argument overrides -t type if used together. 
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NOTE 

You should use showfs, not df, to get information about the 
cumulative amount of free space in a PFS file system. Using the 
standard df command on a PFS file system only gives information 
about the single disk partition on which the PFS file system is 
mounted, so does not indicate how much space is actually 
available for file striping. 


Increasing the Size of a File 


Command Synopsis Description 

lsize [ -a ] size file [file ... ] Change the size of a file or files. 


The lsize command changes the amount of disk space allocated to each specified file. You can use 
this command to allocate all the space you will need for a large file before you run the application 
that writes to the file. This makes sure that there is enough room in the file system for the file, and 
can also increase file I/O performance. 

The lsize command has two forms: 

lsize size file [file ... ] Sets the size of the file(s) to size bytes. 

lsize -a size file [ file ... ] Increases the size of the file( s) by size bytes. 

If the specified file does not exist, it is created with the specified size. The size can be a simple integer 
to represent a number of bytes, or an integer followed by the letter k, m, or g to represent a number 
of kilobytes (1024 bytes), megabytes (1024K bytes), or gigabytes (1024M bytes). 

For example, the following command sets the size of the file mydat to 5M bytes: 

% lsize 5m mydat 

The following command increases the size of the file mydat by 200K bytes: 

% lsize -a 200k mydat 

The additional space is allocated to the file from the file system, but it is not initialized (its contents 
are undefined). 

lsize will not decrease the size of a file. If the specified size is smaller than the file’s current size, the 
command has no effect 
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Using Parallel I/O Calls 

The rest of this chapter discusses the parallel 110 calls you can use in parallel applications to access 
both PFS and non-PFS files. 

The term parallel HO calls refers to all I/O calls that are provided by Paragon OSF/1 but not by 
standard OSF/1. These calls facilitate VO on multiple nodes and permit I/O to very large files in PFS 
file systems. They are part of the library libnx.a, which is automatically searched when you link an 
application with the -nx switch. You can also use the switch -tax to search libnx.a without using -nx. 
See “Compiling and Linking Applications” on page 2-5 for more information on these switches. 

Most of the parallel I/O calls can only be used in programs running in the compute partition. They 
will not work, or will give unexpected results, if used in a program running in the service partition. 
(See “Managing Partitions” on page 2-25 for more information on the service and compute 
partitions.) 


NOTE 

The parameter filelD in the system call synopses in this chapter is 
an integer that represents an open file: a unit in Fortran, or a file 
descriptor in C. 


A call description at the beginning of each section or subsection gives a language-independent 
synopsis (call name, parameter names, and brief description) of each call discussed in that section. 
Differences between C and Fortran are noted where applicable. See Appendix A for information on 
call and parameter types; see the Paragon “ C System Calls Reference Manual or the Paragon ™ 
Fortran System Calls Reference Manual for complete information on each call. 
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Opening Files in Parallel 


Synopsis Description 

gopenipath, oflag, iomode [ , perms ]) (C) Open a file on all nodes and set its I/O 

gopen(unh, path, iomode) (Fortran) mode. 


To open a file for use by all the nodes in your application, call gopenO. You can use gopen() to open 
files in both PFS and non-PFS file systems. gopen() works like the standard openO operation, with 
the following exceptions: 

• It is a global call. All the nodes in the application must call gopenO, and all must call it with the 
same arguments. 

• It is a synchronizing call. Each node blocks at the gopenO until all the nodes have called it. 

• It sets the HO mode of the file, as described under “Using I/O Modes” on page 5-13. 

• When called on a large number of nodes, it offers better performance and causes less system 
overhead. 

Note that gopenO must be called by all the nodes in the application, even those that do not actually 
perform any I/O. For example, suppose that your application has a “manager” node that assigns I/O 
work to the “worker” nodes, but does no I/O itself. If you want to use gopenO, all the nodes, even 
the manager, must open the file. 
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Using gopenO in C 

The C version of gopenO opens the specified file and returns a file descriptor, like the standard 
OSF/1 system call openO. In addition to being a global synchronizing call and setting the I/O mode 
of the file, as discussed earlier, the C version of gopenO has the following differences from the 
standard open(): 

• It can only be used to open an ordinary file (not a directory or a device special file). 

• If an error occurs, it prints an error message and terminates the calling process. 

gopen() is otherwise equivalent to open(). For example, the following C call opens the file 
/pfslmydat for reading and writing, creating it if it does not exist, and returns a file descriptor that 
you can use to access it. The file’s I/O mode is set to M_GLOBAL, and if the file is created it is 
given permissions 644 octal (rw- r - - r - -). 

#include <fcntl.h> 

^include <nx.h> 
int fd; 

fd = gopen("/pfs/mydat", 0_RDWR | 0_CREAT, M_GLOBAL, 0644); 

The symbolic names for oflag (such as 0_CREAT) are defined in the header file fcntlh, and the 
symbolic names for iomode (such as M_GLOBAL) are defined in the header file nx.h. 

See open() in the OSFI1 Programmer’s Reference for information on the oflag parameter; see 
“Using I/O Modes” on page 5-13 for information on the iomode parameter; see chmod() in the 
OSFI1 Programmer’s Reference for information on the perms parameter. 


Using gopenO in Fortran 

The Fortran version of gopenO opens the specified file for unformatted I/O on a specified unit. It is 
equivalent to the following Fortran openO statement: 

OPEN(unifc, path, status='unknown', form='unformatted', 
x access='sequential') 

However, it differs from the standard Fortran open() in that it is a subroutine. Also, as discussed 
earlier, it is a global synchronizing call and sets the I/O mode of the file. 

For example, the following Fortran call opens the file /pfslmydat on unit 10 in I/O mode 

M_GLOBAL: 

include 'fnx.h' 

call gopen(10, "/pfs/mydat", M_GLOBAL) 

The symbolic names for iomode (such as M_GLOBAL) are defined in the header fi\tfnx.h. 
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Opening Files with Standard Operations 

PFS and non-PFS files can also be opened and closed with the standard OSF/1 system calls and 
Fortran routines. For example, to open the file Ipfslmydat for read and write access: 

/* C version */ 

fd - open("/pfs/mydat", 0_CREAT | 0_RDWR, 0644); 

c Fortran version 

open(unit=10, file = '/pfs/mydat', 
x status = 'new', form='unformatted') 

Use this method when not all nodes open the same file at the same time, or when source 
compatibility with other systems is necessary. (Note that, if you want to use any synchronizing calls, 
all nodes must open the file.) 


NOTE 

In Fortran, you must open the file with form='unformatted’ in 
order to use any parallel I/O calls on the file. 


The following section discusses additional special considerations for Fortran. 


Special Considerations for Fortran 

This section describes the special considerations that apply when you open files with the standard 
Fortran open() instead of gopen(). 


Formatted Versus Unformatted I/O 

If you call openO with form='formatted’ (the default): 

• You must use only Fortran I/O statements to access the file. You cannot use any of the parallel 
I/O calls described in this chapter on the file. 

• Only one node may perform I/O to the file. If you perform formatted I/O to the same file from 
multiple nodes, the results are undefined. 

If you open a file with form='unformatted ', you can use either Fortran I/O statements or parallel 
I/O calls to access the file. However, you must pick either one or the other: mixing Fortran I/O and 
parallel I/O to the same file can give unexpected results. 
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For the best I/O performance, you should use gopenO, or open() with form='unformatted', and use 
parallel I/O calls for all file I/O. 

If compatibility with other programs that use formatted I/O is required, you can perform formatted 
I/O to an internal file or a string and then use cwriteO to write the data to a file. However, if you use 
a string you must add a newline (ASCII character 10) to the end of the string using the function 
charO, since neither formatted I/O to a string nor cwriteO will add these for you. For example: 

include 'fnx.h' 
character*20 msgbuffer 

write(msgbuffer, 26) answer, char(10) 

26 format('The answer is: ' , i4, al) 
call cwrite(10, msgbuffer, 20) 

Alternatively, you can write a small program that translates your data files from unformatted to 
formatted and vice versa, and run it only when you need to share data with other programs. 


New Files 

If you call openO with status='new\ the result depends on whether or not the program is running 
on multiple nodes: 

• If the program is running on one node (numnodesO is 1 or undefined), the open() fails if the 
file exists, as specified by the ANSI standard. 

• If the program is running on multiple nodes (numnodesO is greater than 1 ), the file is truncated 
if it exists, as though you had specified status='unknown'. 

This change makes it possible to specify status='new' when multiple nodes are opening a file that 
does not yet exist; with the standard Fortran semantics for status='new', the first node to execute 
the openO statement would create the file, and the other nodes would fail because the file already 
exists. You can use the system call statO to determine if a file exists before you open it. 
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Unnamed Files 

If you call openO with no filename, the result depends on whether or not you specified 
status-'scratch 1 : 

• If you did not specify status-'scratch', the file is created in the current working directory with 
the filename fort. nnn . where mn is the unit number. The file remains after the program 
terminates. 

• If you specified status-'scratch', the file is created in the directory lusrltmp with the filename 
FT Nxxxxxxxx .nn. where xaxma: is the OSF/1 process ID of the creating process and an is the 
unit number. The file does not remain after the program terminates, whether it terminated 
normally or abnormally. 

For compatibility with the iPSC system, if you specified status- 'scratch' and the directory specified 
by the variable CFS MOUNT exists (or, if CFSJdOUNT is not defined, if the directory Icfs exists), 
the file FT Nxxxxxxxx .nn is created in $CFS_MOUNT (or Icfs) instead of lusrltmp. 


Using I/O Modes 


Synopsis Description 

setiomode(/i/£/D, iomode ) Set the I/O mode for a file. 

iomodelfilelD) Return the current I/O mode for a file. 


A parallel application can access a file in one of five I/O modes. You can specify a file’s I/O mode 
when you open it with gopen(), and you can use setiomodeO to change the I/O mode of a file that 
is already open. You can use iomodeO to determine an open file’s current mode. 

Like gopenO, setiomodeO is a global synchronizing call. When a node calls setiomodeO, it blocks 
until all the other nodes in the application call setiomodeO with the same arguments. setiomodeO 
must be called by all the nodes in the application, even those that do not actually perform any I/O 
(this means that all nodes must open the file). Also, setiomodeO can only be used on an ordinary 
file, not a directory or a device special file. 

A file’s I/O mode actually belongs to the file descriptor or unit through which the file is accessed, 
not to the file itself. The I/O mode is not stored with the file, and different programs can access the 
same file with different I/O modes (even at the same time). 

A file’s I/O mode is not inherited across a forkO (after a forkO all files in the child process have I/O 
mode M_UNIX). 
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There are five I/O modes, each of which has a name and a number: 


M UNIX (0) In this mode, each node has its own file pointer and file operations are 

performed on a first-come, first-served basis. If you open a file with the C 
openO call or Fortran open statement, it is opened with this mode (but you 
can change it with setiomodeO). 

M_LOG (1) In this mode, all nodes share the same file pointer and file operations are 
performed on a first-come, first-served basis. 


MSYNC (2) 


In this mode, all nodes share the same file pointer and file operations are 
performed in order by node number. Records may be of variable length. 


MRECORD (3) 

In this mode, each node has its own file pointer and file operations are 
performed on a first-come, first-served basis. However, records are stored in 
the file in order by node number. Records must be of a fixed length. 


MGLOBAL (4) 

In this mode, all nodes share the same file pointer and must perform the same 
file operations at the same time. All file operations are performed by a single 
node, which then distributes the results to the other nodes over the intemode 
network. 

The names M_UNIX, MLOG, M SYNC, M RECORD, and M GLOBAL are constants 
defined in the header files nx.h (for C) mdfnx.h (for Fortran). You can use either these names or the 
corresponding numbers in your programs (using the names is recommended). 

The I/O mode you choose for a file determines which, if any, parallel I/O calls become synchronizing 
operations (that is, each node blocks until all nodes have made the call). The synchronizing 
operations for each mode are described in the following sections and are summarized under 
“Synchronization Summary” on page 5-48. 


UNIX (Mode 0) 

In mode M_UNIX (number 0), each node maintains its own file pointer. File access requests are 
honored on a first-come, first-served basis. If two nodes write to the same place in the file, the second 
node overwrites the data written by the first node. This mode is the default for files opened with the 
standard open call or statement. 

Use this mode in applications where each node performs I/O on disjoint segments of the file, or 
where I/O accesses are synchronized by other means (such as message-passing inherent to the 
application). 
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M_LOG (Mode 1) 

In mode M LOG (number 1), all nodes share a single file pointer. File accesses are performed on a 
first-come, first-served basis. Whenever any node reads, writes, or moves the pointer, it affects the 
pointer position for all nodes. This may change the results of subsequent reads, writes, or moves by 
other nodes. This mode is useful for parallel log files. 

Closing a file in this mode is a synchronizing operatioa When a node closes a file, the operation 
blocks until all the other nodes also close the file. 


M_SYNC (Mode 2) 

In mode M_SYNC (number 2), all nodes share a single file pointer and the nodes access the file in 
a synchronized round-robin fashion. 

• All nodes share a single file pointer, as for M_LOG. 

• All the nodes in the application must open the file, and all must perform the same operations on 
the file in the same order. Reads and writes can be of variable sizes. 

• All file operations are synchronizing. 

Closing a file is a synchronizing operation, as for M_LOG. In addition, reading, writing, 
seeking (using lseekO) and detecting end-of-file (using iseof()) become synchronizing 
operations—they block until all nodes have called them. For example, when a node reads from 
a file with the parallel I/O call creadO, the node blocks and the read request is not honored until 
all other nodes have called creadO- 

• All reads and writes to the file are performed in order by node number. 

For example, suppose node 3 in an application running on four nodes writes to a file with the 
parallel I/O call cwriteO before any of the other nodes. The node blocks until all the other nodes 
have called cwriteO- When all nodes have called cwriteO, the data from node 0 is written to the 
file first, followed by the data from node 1, then the data from node 2, and finally the data from 
node 3. 

• The only valid use for lseekO is for all nodes to seek to the same position in the file. If nodes 
attempt to seek to different positions, an error occurs. 
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M_RECORD (Mode 3) 

Mode MRECORD (number 3) gives results that are similar to M_SYNC, but it operates more 
efficiently. However, M_RECORD requires a fixed record size. 

• Each node has its own file pointer, as for M_UNIX. 

• All the nodes in the application must open the file, and all must perform the same operations on 
the file in the same older, as for MJSYNC. 

• Corresponding reads and writes must be of the same size on all nodes. 

When a node reads or writes to the file for the nth time, it must read or write the same number 
of bytes as the nth read or write by every other node. For example, if node 0 writes 100 bytes to 
the file with its first call to cwriteO and 50 bytes with its second call to cwriteO, then all nodes 
must write 100 bytes with their first call to cwriteO and 50 bytes with their second call to 
cwriteO. 


NOTE 

No verification is performed. You must make sure that all the 
nodes in the application make the same calls and read and write 
the same number of bytes. 


If different nodes read different amounts of data, incorrect data will be read. If different nodes 
write different amounts of data, the output of different nodes will overwrite each other and/or 
leave areas of the file with uninitialized data. 

• All reads and writes appear to be performed in order by node number. 

Because reads and writes are of a known length, the operating system on each node can 
determine where in the file it should be reading from or writing to independently of the other 
nodes. Tire results of reading or writing a file with M RECORD are the same as M_SYNC, 
but M_RECORD is more efficient because no synchronization is required. No seeking is 
required by the application; the file system automatically reads or writes file data to or from the 
proper offset in the file. 

For example, suppose node 2 in an application running on four nodes writes a 10-byte record. 
Node 2’s file pointer is first moved forward by 20 bytes to leave room for the records from nodes 
0 and 1. Next, node 2’s record is written to the file (which advances the file pointer by 10 bytes). 
Finally, node 2’s file pointer is moved forward by 10 bytes to leave room for node 3’s record. 
The other nodes can fill in their “slots” at any time (earlier or later); no synchronization or 
communication between nodes is required. 
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• Closing a file is a synchronizing operation, as for M LOG and M SYNC. 

• As for M_S YNC, IseekO becomes a synchronizing call, and the only valid use for lseek() is for 
all nodes to seek to the same position in the file. If nodes attempt to seek to different positions, 
an error occurs. 


M.QLOBAL (Mode 4) 

In mode M_GLOBAL (number 4), all nodes must read and write the same data to the same parts of 
the file at the same time. This mode gives excellent performance for programs that woik this way, 
such as a program where every node reads in the entire contents of a large input file. 

• All nodes share a single file pointer. 

• All the nodes in the application must open the file, and all must perform the same operations on 
the file at the same time. 

• All file operations are synchronizing. 

• Corresponding reads and writes must be of the same size on all nodes. 

• The only valid use for IseekO is for all nodes to seek to the same position in the file. 

• When the nodes write to a file, only the data written by a single node is actually written. Data 
written by other nodes is ignored. 

The way that this mode is implemented is that only a single node actually reads from and writes to 
the disk. After a read, that node distributes the data to the other nodes over the intemode network. 
This eliminates the contention for the disk device that would otherwise occur when many nodes 
attempt to read from the same place in a file at the same time. 


An I/O Mode Example 

This section provides a small example program (in Fortran and C) that you can compile and execute 
to illustrate the differences between the various I/O modes. The source for this program can be found 
on the Intel supercomputer in / usr ■/, share!examples!fortranliomodesliomodes.f (Fortran version) or 
lusrlsharelexampleslcliomodesliomodes.c (C version). 

The example program works as follows: node 0 gets an I/O mode from the user (specified as a 
number), and sends it to the other nodes. Then all nodes call gopenO to open the file mydat in the 
current directory (which could be in either a PFS file system or a non-PFS file system) with the 
specified I/O mode. 
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Each node then writes 10 records to the file. Each record contains the time in seconds since the file 
was opened, to four decimal places, and the message “Hello from node x.” Node 0 waits one second 
before each write to the file; the other nodes write as fast as they can (this demonstrates how writes 
to the file are differently synchronized in the different modes). When each node finishes writing, it 
writes a “done writing” message to the screen. Then it closes the file and writes a ‘ Ymished” message 
to the screen (the two messages show that, in some modes, closeO is a synchronizing operation). 


Fortran Example 

program iomodes 
include 'fnx.h' 
integer nunit, mode, iam 

double precision start, now, loop__time, loop__start 
character*16 msg 
character*29 msgbuffer 

msg = 'Hello from node ' 
nunit = 12 
iam = mynode () 

if(iam .eq. 0) then 

print *, 'Enter I/O mode (0, 1, 2, 3, or 4):' 
read(*, 11) mode 
11 format (il) 

call csend(l, mode, 4, -1, myptype()) 
else 

call crecv(l, mode, 4) 
endif 

call gopen (nunit, ''mydat" , mode) 
print 13, iam, iomode(nunit) 

13 format('Node ', i4, ' using mode ', il) 

start = dclock() 
do 100 i = 1, 10 

c *** if node 0, do nothing for 1.0 seconds *** 

if(iam .eq. 0) then 

loop^start * dclock() 

101 loop__time = dclock () - loop__start 

if (loop_time .It. 1.0) goto 101 
endif 
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c *** all nodes now write a record to the file *** 

102 now = dclock() - start 

write(msgbuffer, 14) now, msg, iam, char(10) 

14 format(f7.4, al7, i4, al) 

call cwrite(nunit, msgbuffer, 29) 

100 continue 

print 15, iam 

15 format(’Node i3, * done writing’) 
close(nunit) 

print 16, iam 

16 format(’Node ’, i3, ' finished’) 
end 


C Example 

#include <fcntl.h> 

#include <stdio.h> 

#include <nx.h> 

main() 

{ 

int i, fd; 

double start, now; 

double loop_start, loop__cur; 

long mode, iam; 

char instring[40], msg[40]; 

iam = my node () ; 

if(iam == 0) { 

printf (’’Enter I/O mode (0, 1, 2, 3, or 4):\n”); 
gets(instring); 

sscanf (instring, " %ld ’’, &mode) ; 

csend(l, &mode, sizeof(mode), -1, myptype()); 

} else { 

crecv(1, &mode, sizeof(mode)); 

} 

fd = gopen( ’’rnydat", 0_WR0NLY | 0_CREAT | 0_TRUNC, mode, 0666); 
printf (’’Node %d using mode %d\n”, iam, iomode(fd)); 
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start = dclock(); 
for(i=0;i<10;i++) { 
if (iam==0 ) { 

loop_start = dclock(); 
loop_cur = loop_start; 
while (loop__cur - loop_start < 1.0) { 
loop__cur = dclock () ; 

} 

} 

now = dclock() - start; 

sprintf(msg, M %7.4f Hello from node %41d\n", now, iam); 
cwrite(fd, msg, strlen(msg)); 

} 

printf("Node %d done writing\n”, iam); 
close(fd); 

printf("Node %d finished\n", iam); 

} 


Compiling and Running the Example 

To compile this program to a parallel application, use the following if77 or icc command: 

% if 77 -nx iomodes.f -o iomodes 
or 

% icc -nx iomodes.c -o iomodes 

When you run the resulting application, you may find the output easier to understand if you run the 
example on four or fewer nodes. Use the -sz switch to determine the number of nodes on which the 
application runs (see “Controlling the Application’s Execution Characteristics” on page 2-13 for 
information on -sz and other application switches). 
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For example, to run the application on two nodes of your default partition with I/O mode 1 
(MJLOG): 

% iomodes -sz 2 

Enter I/O mode (0, 1, 2, 3, or 4): 

I 

Node 0 using mode 1 
Node 1 using mode 1 
Node 1 done writing 
Node 0 done writing 
Node 1 finished 
Node 0 finished 
% 

The following example outputs came from the C version of the example, run on two nodes. 


M_UNIX Output 

In mode MUNIX (0), each node has its own file pointer. Node 1 finishes right away. Node 0 waits 
before each write and overwrites the message from node 1. As a result, the file contains only the 
writes from node 0. 

1.0000 Hello from node 0 

2.0087 Hello from node 0 


9.0711 Hello from node 0 

10.0797 Hello from node 0 
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M_LOG Output 

In mode MJLOG (1), the nodes share a common file pointer, but there is no synchronization. As in 
mode M_UNIX, node 1 finishes right away; but this time, node 0 appends its data to the file rather 
than overwriting the data from node 1. 

0.0000 Hello from node 1 

0.0382 Hello from node 1 


0.0990 Hello from node 1 
0.1076 Hello from node 1 
1.0000 Hello from node 0 
2.0086 Hello from node 0 


9.0712 Hello from node 0 

10.0804 Hello from node 0 

If the output file were large enough so that node 0 started before node 1 finished, the output of the 
two nodes would be interleaved in the middle of the file. 


M_SYNC Output 

In mode MJSYNC (2), the nodes share a common file pointer, and there is synchronization. Nodes 
1 and 0 finish at around the same time. Because node 1 waits for node 0 on each write, the writes are 
interleaved within the file. 

1.0000 Hello from node 0 

0.0000 Hello from node 1 

2.0278 Hello from node 0 

1.1105 Hello from node 1 


9.2262 Hello from node 0 

8.1641 Hello from node 1 

10.2535 Hello from node 0 

9.1914 Hello from node 1 

Note that node 0’s records appear earlier in the file than node 1 ’s, but the time value shown for each 
record from node 0 is later than for the corresponding record from node 1. This is because the value 
shown is the time at which cwriteO was called, but node 1 ’s record was not actually written to the 
file until node 0 had written its record. 
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In this case, node 1 called cwriteO for the first time immediately after opening the file, at time 0, but 
the cwriteO blocked and the record was not written to the file until after node 0 called cwriteO for 
the first time, at time 1.0000 (1.0000 seconds after the file was opened). Node 1 then called cwriteO 
for the second time, at time 1.1105, but that cwriteO again blocked until after node 0 called cwriteO 
again at time 2.0278, and so on. 


M_RECORD Output 

In mode M_RECORD (3), the nodes access the file in round-robin fashion, but there is no lock-step 
synchronization. Node 1 finishes first. Then, node 0 goes into the file and fills in its data in the 
correct places. Because the records are of a fixed length, node 0 has no trouble doing this. The result 
is that the records are in the same order as in mode M_SYNC, but node 1 did not spend any time 
waiting for node 0. 

1.0000 Hello from node 0 

0.0000 Hello from node 1 

2.0208 Hello from node 0 

0.0505 Hello from node 1 


9.1637 Hello from node 
0.1955 Hello from node 
10.1841 Hello from node 
0.2158 Hello from node 

Note that node 1 finished in only 0.2158 


0 

1 

0 

1 

seconds, without having to wait for node 0. 


M GLOBAL Output 

In mode M_GLOBAL (4), writes by all nodes but one (node 0 in this case) are ignored. As a result, 
the file contains only the writes from that node. 

1.0000 Hello from node 0 

2.0087 Hello from node 0 


9.0711 Hello from node 0 

10.0797 Hello from node 0 

This output is the same as the output of M_UNIX, but the other nodes do not compete with node 0 
for access to the disk, so this mode is more efficient. However, because this program uses such a 
small data file, the difference in execution time is probably not noticeable. 

Note that M_GLOBAL is usually used for reading, not writing. 
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Reading and Writing Files in Parallel 

You can read and write files with the familiar OSF/1 system calls and Fortran routines. For example, 
here is a Fortran code fragment that opens a file whose pathname is Ipfslmydat and reads some data 
into an array called array using the Fortran read statement: 

open(unit=10, file='/pfs/inydat', foxm='unformatted') 
read 10, (array(j), j=l, n) 

In addition to the usual I/O facilities, the Paragon OSF/1 operating system offers a series of parallel 
I/O calls, which are discussed in the following pages. These calls can be used on files in both PFS 
and non-PFS file systems. 

Like the message-passing calls, the parallel I/O calls offer you the choice of synchronous or 
asynchronous I/O. The synchronous calls begin with c (for “complete”) and do not return until the 
operation is complete. The asynchronous calls begin with i (for “incomplete”) and return 
immediately; you use the call iodoneO or iowait() to determine when the operation is complete. 

If you program in Fortran, you should use the parallel I/O calls rather than Fortran I/O whenever you 
can. These calls offer better performance than the Fortran I/O routines, and you can test for the end 
of a file with iseofQ. (This does not apply to C programmers; the usual C I/O calls are as efficient as 
their parallel I/O counterparts.) However, if you use parallel I/O calls on a file, you must not use 
Fortran file I/O statements on the same file (for example, you must not mix write and cwriteO on 
the same file). 


NOTE 

Parallel I/O to NFS files may give poor performance or unexpected 
results. 


The Intel supercomputer’s disk I/O hardware and software are designed to support simultaneous 
access by large numbers of nodes. However, a remote NFS server may not be configured to support 
this level of access. If you perform large parallel I/O operations from large numbers of nodes to a 
file that is NFS-mounted from another computer, you may overload the network or the NFS server, 
resulting in poor performance or unexpected results. 



Paragon™ User’s Guide 


Using Parallel File I/O 


Synchronous File i/O 

Synopsis 

Description 

creaA(fileID, buffer, nbytes) 

Read from a file, waiting for completioa 

cyrriteifilelD, buffer, nbytes) 

Write to a file, waiting for completion. 

creadv(/?/e/Z), iov, iovcnt) 

Read from a file to irregularly-scattered buffers, 
waiting for completion. 

cwritev(/i lelD, iov, iovcnt) 

Write to a file from irregularly-scattered buffers, 
waiting for completion. 


The calls creadO, cwriteO, creadvO, and cwritevO perform synchronous file I/O. They are 
equivalent to the standard OSF/1 calls readO, writeO, readvO, and writevO, except that they follow 
the same naming and error-handling conventions as the Paragon OSF/1 message-passing calls (see 
“Names of Send and Receive Calls” on page 3-7 for information on the Paragon OSF/1 system call 
naming conventions; see “Handling Errors” on page 4-42 for information on the Paragon OSF/1 
error-handling conventions). Unlike their standard OSF/1 equivalents, these calls are available to 
Fortran programs (as well as C). 

For example, here is a C code fragment that writes the message “Hello from node x" to the file 
Ipfslhello: 

fd = open(”/pfs/hello", 0_RDWR, 0644); 


sprintf(buffer, "Hello from node %d\n", iam); 
cwrite(fd, buffer, strlen(buffer)); 

Here is a slightly more complicated example: a Fortran code fragment that opens a file whose 
pathname is ipfalmydat, seeks to a location, and reads some data using the synchronous call creadO. 
The data represents a matrix stored in rows of n four-byte elements. Each node reads m rows and 
performs a calculation with each row (calling the Basic Linear Algebra Subroutines routine sdot() 
to get the dot product of two vectors). Because each node seeks to a different place in the file, you 
must use I/O mode M UNIX (the default). 

open(unit=10, file='/pfs/mydat', form='unformatted') 
lseek(10, 4*mynode()*n*m, 0) 

do 10 i = 1, m 

call cread(10, arow, n*4) 
y(i) = sdot(n, arow, 1, xtotal, 1) 

10 continue 
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Note that when you open a file in Fortran, you must open it as sequential and unformatted to be able 
to use creadO and cwriteO- (Sequential is the default access, but you must specify 
form*=’unformatted\) 


NOTE 

Unlike their OSF/1 equivalents, these calls do not return the 
number of bytes read or written. If any error occurs, these calls 
print an error message and terminate the calling process. 


Reading past the end of a file is considered an error, so you must be certain you know how many 
bytes remain in the file before you read from it. You can use iseof(), to detect end-of-file, after each 
creadO or creadv(). You can also use the following call to determine the length of a file: 

length = lseek(unit, 0, SEEK_END) 

This call sets the file pointer to the end of the file and returns the current position of the file pointer 
(that is, the file’s length). You can then use lseek(unit, 0, SEEK_SET) to return the file pointer to 
the beginning of the file. (If the file might be larger than 2G - 1 bytes, use eseek() instead of lseek(); 
see “Manipulating Extended Files” on page 5-36 for more informatioa) 

If you need to detect errors in reading and writing, you must program in C and use either the standard 
OSF/1 calls (readO, writeO, readvO, and writevO, described in the OSFI1 Programmer’s 
Reference ) or the underscore versions of the parallel I/O calls (_cread(), _cwrite(), _creadv(), and 
_cwritev(), described under “Handling Errors” on page 4-42). The underscore versions do return the 
number of bytes read or written. 
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Asynchronous File I/O 


Synopsis 

Description 

iretuMfilelD, buffer, nbytes ) 

Read from a file without waiting for completion. 

iwrite(/We/Z), buffer, nbytes) 

Write to a file without waiting for completion. 

ireadv(fileID, iov , iovcnt ) 

Read from a file to irregularly-scattered buffers, 
without waiting for completion. 

iwrit evffilelD, iov, iovcnt) 

Write to a file from irregularly-scattered buffers, 
without waiting for completion. 

iodonefirf) 

Determine whether an asynchronous I/O 
operation is complete. If complete, release the 

I/O ID. 

iowait (id) 

Wait for completion of an asynchronous I/O 
operation and release the I/O ID. 


The calls iread(), iwriteO, ireadvO, and iwritevO perform asynchronous file I/O. They work like 
cread(), cwrite(), creadvf), and cwritev(), but they return immediately, without waiting for the read 
or write to complete. The asynchronous I/O calls return an I/O ID much like the message ID returned 
by the asynchronous message passing calls. You can pass this I/O ID to iodoneO or iowaitO to 
determine when the asynchronous file I/O operation has completed. 


NOTE 

The number of I/O IDs is limited, so you must use iodoneO or 
iowaitO to release each ID after you use it. 
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To check if an asynchronous I/O operation has completed, use the iodoneO call. It returns 1 if the 
asynchronous operation has completed and 0 otherwise. You can also decide to block on the 
completion of an asynchronous call. Use the iowaitO call for this. Both iodoneO and iowaitO take 
the I/O ID as an input parameter. For example (in Fortran): 

c Write to a file 

ioid = iwrite(12, sbuf, size) 


c Do some calculation... 


c Wait until the write completes 
call iowait(ioid) 

The number of available I/O IDs is limited; be sure to release IDs that are no longer needed. There 
are two ways to release an I/O ID: you can issue an iowaitO, as shown in the previous example, or 
you can keep issuing iodoneOs until an iodoneO returns 1. 

NOTE 

To preserve data integrity, all I/O requests that use or affect the file 
pointer are processed on a “first-in, first-out” basis. 


This means that if an asynchronous I/O call is followed by a synchronous read, write, or seek on the 
same file, the synchronous call will block until the asynchronous operation has completed. 


Closing Files in Parallel 

It’s always a good idea to close a file when you are finished using it. Whether you used open() or 
gopen() to open a file, and whether the file is a PFS file or a non-PFS file, you use the standard 
OSF/1 system calls or Fortran routines to close it. 

For example, to close the file open on file descriptor fd (C) or unit 10 (Fortran): 

/* C version */ 
close(fd); 

c Fortran version 

close(unit=10) 
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NOTE 

If the I/O mode of the file being closed is anything other than 
M_UNIX, closing the file is a synchronizing operation. 

See “Using I/O Modes” on page 5-13 for more information. 

Detecting End-of-File and Moving the File Pointer 

Synopsis Description 

iseofffilelD) Test for end-of-file. 

IseektfilelD, offset, whence) Move the read/write file pointer. 


The calls iseof() and lseekO are provided for both C and Fortran programmers. If you use parallel 
I/O calls to perform file I/O in a Fortran program, you must use iseofO and lseekO instead of the 
equivalent Fortran features. 

The iseofO call returns 1 if the given file is at the end of the file and 0 otherwise. For example, the 
following Fortran code reads characters from the file open on unit 12, writing each one to the screen, 
until it reaches the end of the file: 

do while(iseof(12) .eq. 0) 
call cread(12, char, 1) 
print 300, iam, char 

300 format( 1 Node ', i3,' read: ', al) 

end do 

The lseekO call moves the file pointer to offset bytes from the point specified by whence, which can 
be either a name or a number: 

• If whence is SEEK SET, lseekO moves the pointer to offset bytes from the beginning of the 
file. 

• If whence is SEEKCUR, lseekO moves the pointer forward offset bytes from its current 
position. 

• If whence is SEEK_END, lseekO moves the pointer to offset bytes after the end of the file. 

The names SEEK SET, SEEK CUR, and SEEK END are constants defined in the header files 
unistd.h (for C) and fnx.h (for Fortran). For compatibility with the iPSC system, the numeric values 
0,1, and 2 are also accepted (but using the symbolic names is recommended). 
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lseek() returns the new position of the file pointer (measured in bytes from the beginning of the file). 

For example, the following C call moves the file pointer of the file open on file descriptor fd to the 
beginning of the file: 

#include <unistd.h> 

newpos = lseek(fd, 0, SEEK_SET); 

The following Fortran call moves the file pointer of the file open on unit 12 forward 500 bytes: 

include 'fnx.h’ 

newpos = lseek(12, 500, SEEK_CUR) 

NOTE 

If the I/O mode of the file is M_SYNC, MJRECORD, or 
M_GLOBAL, seeking is a synchronizing operation. 

See “Using I/O Modes” on page 5-13 for more information. 

Flushing Fortran Buffered I/O 


Synopsis 

Description 

forceflushO 

Cause all buffered I/O to be flushed if an 
exception occurs. 

forflush(ww0 

Flush all buffered I/O on a particular unit. 


The subroutines forceflush() and forflushO let Fortran programmers make sure that buffered I/O 
actually goes to the associated file or device. These subroutines are not available to C programs. 

Fortran I/O to files and devices other than the user’s terminal is buffered —that is, when you write to 
a file, the data is stored in a memory buffer, and only written to the corresponding file or device when 
the buffer is full. However, if another node is waiting for some data to appear in a file, you might 
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want to force the contents of a unit’s buffer to be written immediately. You can do this by calling 
forflushO on the unit. For example, to flush all buffered I/O on unit 9 to the corresponding file or 
device: 


call forflush(9) 

Another possible problem with buffered I/O is that if the program is interrupted by an exception, 
buffered data that has not yet been written to the file is lost. The subroutine forceflush() establishes 
a signal handler that flushes all buffered I/O in case of an exceptioa You call it as follows: 

call forceflush 

Note that you must call forceflushO before the exception occurs. You can use fpsetmask/) 
(described under “Controlling Floating-Point Behavior” on page 4-46) to control whether or not an 
exception occurs in case of certain floating-point errors. 

Fortran I/O to the user’s terminal is not buffered. You can avoid buffering to files and devices by 
using parallel file I/O calls such as cwriteO and iwriteO instead of Fortran I/O. These calls do not 
buffer I/O into the Fortran I/O memory buffer, when the call returns, you can be sure the data has 
been sent to the specified file or device. (However, there may be some buffering within the operating 
system, which cannot be avoided.) 


Using “###” Filenames 

If you perform certain standard file operations (shown in Table 5-1) on a file that contains three or 
more # symbols in its filename, the series of # symbols is automatically replaced by the node number 
(within the application) of the node that opens the file. 


Table 5-1. File Operations that Accept “###” Filenames 


access/) 

mknodO 

truncate/) 

chdirO 

open/) 

unlink/) 

dunodO 

readlinkO 

utimes/) 

chownO 

rmdirO 

link/) 

creatO 

statO 

rename/) 

mkdir() 

statfsO 

syndinkQ 


The Fortran equivalents of these operations also support “###” filenames. Note, though, that 
gopen() does not appear in this list. 

For example, assume that you have the same program running on all the nodes of your application, 
and each node calls openO to open a file called file###. The result is that each node opens a separate 
file. Node 0 opens JileOOO, node 1 opens fileOOl, node 2 opens file002, and so on. If an application 
opens file### for reading, the specified files (JileOOO, fileOOl, file002, and so on) must exist. 
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If you use a “###” filename in a non-parallel program running in the service partition, it uses node 
number 0. For example, opening a file called file### from a service node opens fileOOO. Note that 
this also affects standard commands that make these calls; for example, since the rm command calls 
unlink(), the command rm file### will attempt to remove the fil e fileOOO. 

Filenames containing a sequence of one or two # symbols are not affected. For example, the file 
file## is a single file that is accessible by each node. 

If the number of digits in a node number is less than the number of # symbols in the filename, the 
node number is padded with zeros to the length of the sequence of # symbols. If the number of digits 
in a node number exceeds the number of # symbols in the filename, the filename is extended, but 
only when necessary. For example, calling unlinkf) on the file data.### in every node of an 
application running on 2000 nodes unlinks files data.000, data.001 , data.002... data.998, data.999, 
data.1000, data.1001 ... data.1998, and data.1999. 

There is nothing special about files created in this way; each file created is a single ordinary file. For 
example, suppose an application uses openO or creat() to create ###myfile , writes into it, and then 
closes the file. This creates a series of files called OOOmyfile, OOlmyfile, 002myfile, and so on. Each 
of these files is an ordinary file; for example, you can delete one without affecting the others, and 
there’s nothing to prevent node 1 from opening 005myfile. 


Increasing the Size of a File 


Synopsis Description 

\sixe(fileID, offset , whence ) Increase size of a file. 


You can allocate more space to a file with Lsize(). The IsizeO call sets the file’s size as specified by 
offset and whence: 

• If whence is SIZE_SET, IsizeO sets the file’s size to offset bytes. 

• If whence is SIZE_CUR, IsizeO sets the file’s size to the current file pointer position plus offset 
bytes. 

• If whence is SIZE_END, IsizeO increases the file’s size by offset bytes. 

The names SIZE_SET, SIZE_CUR, and SIZE END are constants defined in the header files nx.h 
(for C) and fnx.h (for Fortran). For compatibility with the iPSC system, the numeric values 0,1, and 
2 are also accepted (but using the symbolic names is recommended). 
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For example, the following Fortran call increases the size of the file open on until to one million 
bytes: 

include 'fnx.h' 

size = lsize(unitl, 1000000, SIZE_SET) 

The following C call increases the size of the file open on file descriptor fd by 500,000 bytes: 

#include <unistd.h> 

#include <nx.h> 
int size, fd; 

size = lsize(fd, 500000, SIZE_CUR) 

The additional space is allocated to the file from the file system, but it is not initialized (its contents 
are undefined). 

IsizeO will not decrease the size of a file. If the size specified by offset and whence is smaller than 
the file’s current size, the call has no effect. 

The major use of this call is to ensure that enough disk space is available before you begin a lengthy 
calculation. Pre-allocating disk space can also improve disk performance. 


Using Extended Files 

A PFS file greater than or equal to 2G bytes in size is called an extended file. These files are stored 
in the same way as non-extended PFS files. However, some of the file parameters (like the file 
pointer and file size) do not fit into a 32-bit integer. This means that standard OSF/1 calls and 
commands that use these parameters cannot be used on extended files. The following two sections 
list the calls and commands that do not support extended files. 
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OSF/1 Calls that Do Not Support Extended Files 


Most OSF/1 calls, such as readO and writeO, don’t care how big the file is and work perfectly well 
on extended files. The OSF/1 calls that have problems with extended files are shown in Table 5-2. 

Table 5-2. OSF/1 Calls Not Supporting Extended Files 


Call 

Problem 

fcntlO 

Can’t lock a file region larger than 2G -1 bytes. 

fgetposO 

Can’t return an offset greater than 2G -1 bytes. 

fseekO 

Can’t specify an offset greater than 2G -1 bytes. 

unlocked_fseek() 

Can’t specify an offset greater than 2G -1 bytes. 

fsetposO 

Can’t return an offset greater than 2G -1 bytes. 

fstatO 

Can’t be used on a file largo- than 2G - 1 bytes. 1 

PtellO 

Can’t return an offset greater than 2G - 1 bytes. 

ftruncateO 

Can’t specify a file size greater than 2G - 1 bytes. 

lseekO 

Can’t specify an offset greater than 2G -1 bytes. 

IstatO 

Can’t be used on a file larger than 2G -1 bytes. 1 

madviseO 

Can’t map a file larger than 2G - 1 bytes. 

mmapO 

Can’t map a file larger than 2G - 1 bytes. 

mprotectO 

Can’t map a file larger than 2G - 1 bytes. 

msyncO 

Can’t map a file larger than 2G - 1 bytes. 

munmapO 

Can’t map a file larger than 2G - 1 bytes. 

statO 

Can’t be used on a file larger than 2G -1 bytes. 1 

truncate/) 

Can’t specify a file size greater than 2G -1 bytes. 


1. If you call fstatO, lstat(), or statO on a file larger than 2G - 1 bytes, the call fails with 
the error EFBIG. 


To manipulate extended files, Paragon OSF/1 provides special calls that perform the equivalent of 
lseek(), statO, fstatO, and IsizeO for extended files. These are discussed under “Manipulating 
Extended Files” on page 5-36. 
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OSF/1 Commands that Do Not Support Extended Files 

Many OSF/1 commands make one or more of the system calls in Table 5-2, so do not work on 
extended files. The commands chgrp, chmod. chown, cp, Is, mv. tar, and rm have been specially 
modified to support extended files; most other commands will fail if used on extended files. (Note 
that you must use the -E switch to archive an extended file with tar; see tar in the Paragon " 
Commands Reference Manual for more information.) 

Table 5-3 shows the OSF/1 commands that are known to have problems with extended files. (This 
list is not guaranteed to be complete; other commands, not listed here, may also have problems.) 


Table 5-3. OSF/1 Commands Not Supporting Extended Files 


Command 

Problem 

cat 

Can’t read a file larger than 2G - 1 bytes. 

compress 

Can’t compress a file larger than 2G -1 bytes. 

cpio 

Can’t handle files or archives larger than 2G -1 bytes. 

difT 

Can’t compare a file larger than 2G -1 bytes. 

du 

Can’t show the size of a directory containing files larger than 2G - 1 bytes. 

ed 

Can’t edit a file larger than 2G - 1 bytes. 

ex 

Can’t edit a file larger than 2G - 1 bytes. 

find 

Can’t find a file larger than 2G -1 bytes with -size; 
can’t show the size of a file larger than 2G -1 bytes with -is. 

ftp 

Can’t copy a file larger than 2G - 1 bytes. 

more 

Can’t display a file larger than 2G - 1 bytes. 

newfs 

Can’t create a UFS file system larger than 2G -1 bytes. 

rep 1 

Can’t copy a file larger than 2G -1 bytes. 

tail j 

Can’t display a file larger than 2G - 1 bytes. 

vi 

Can’t edit a file larger than 2G - 1 bytes. 


1. Note that although rep cannot copy an extended file, cp can. 
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Manipulating Extended Files 


Synopsis 


Description 

eseekifildes, offset , whence) 
eseek(unit , offset, whence , newpos ) 

(C) 

(Fortran) 

Move file pointer in extended file. 

esixeffildes, offset, whence) 
esiz e(unit, offset, whence, newsize ) 

(C) 

(Fortran) 

Increase size of extended file. 

estntipath, buffer) 

(C only) 

Get status of extended file from pathname. 

lestat (path, buffer) 

(C only) 

Get status of extended file or symbolic link 
from pathname. 

festatifildes, buffer) 

(C only) 

Get status of open extended file from file 
descriptor. 


The e.,.0 calls perform file operations on extended files. They do this by having parameters that are 
extended integers (a data type capable of representing integers greater than 2G - 1). You must use 
the calls described under “Performing Extended Arithmetic” on page 5-37 to operate on extended 
integers. 

• The call eseekO is like lseekO (discussed under “Detecting End-of-File and Moving the File 
Pointer” on page 5-29), except that the offset parameter is an extended integer. The C version 
of this call is a function that returns the new position as an extended integer, the Fortran version 
is a subroutine that stores the new position in its fourth parameter. 

• The call esizeO is like IsizeO (discussed under “Increasing the Size of a File” on page 5-32), 
except that the offset parameter is an extended integer. The C version of this call is a function 
that returns the new size as an extended integer, the Fortran version is a subroutine that stores 
the new size in its fourth parameter. 

• The calls estat(), lestatO, and festatO are like the standard OSF/1 calls statO, lstat(), and fstat() 
(described in the OSFII Programmer’s Reference), except that they use a structure called estat, 
defined in <syslestat.h>, which is the same as the OSF/1 stat structure except that the file size 
is an extended integer. These calls are available only in C, not in Fortran. 

You must use these calls to manipulate extended files (PFS files greater than or equal to 2G bytes in 
size). However, you can also use these calls on non-PFS files and on PFS files less than 2G bytes in 
size. You can use these calls or the standard OSF/1 calls on PFS files less than 2G bytes in size. 
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Performing Extended Arithmetic 


Synopsis 


Description 

eadd(e/, el) 
eadd(d, el. eresult ) 

(C) 

(Fortran) 

Add two extended integers. 

ecmp(e/, el) 


Compare two extended integers. 

ediv(e, n) 
ediv(e, n, result) 

(C) 

(Fortran) 

Divide extended integer by integer. 

emod(e, n) 
emod(e, n, result) 

(C) 

(Fortran) 

Give extended integer modulo an integer 
(remainder when e is divided by n). 

emul(e, n) 
emul(e, n. eresult) 

(C) 

(Fortran) 

Multiply extended integer by integer. 

esub(ii, el) 
esub(eZ, el, eresult) 

(C) 

(Fortran) 

Subtract two extended integers. 

etos(e, s) 


Convert extended integer to string. 

stoeCs) 
stoeCs, e) 

(C) 

(Fortran) 

Convert string to extended integer. 


The extended arithmetic calls manipulate 64-bit integers, also called extended integers. You use 
these calls to manipulate the parameters used by the parallel I/O calls described in the previous 
section. 

Extended integers are signed 64-bit integers with values from (2 63 -1) to -2 63 (2 s3 is approximately 
9.2 xlO 18 ). 

• In Fortran, extended integers are stored in a two-element array of type integer*4. 

• In C, extended integers are stored in a variable of type esizej, a structure type defined in the 
header file <syslestat.h>. (For compatibility with the iPSC system, there is also a header file 
<estat.h> that simply includes <syslestat.h>. 

You should always use extended arithmetic calls to operate on an extended integer, rather than 
access its internal structure. 


5-37 



Using Parallel File I/O 


Paragon™ User’s Guide 


Some of these calls return extended integers. The C versions of these calls return a value of type 
esizej. However, Fortran does not allow functions to return arrays, so the Fortran versions of these 
calls are subroutines with an additional parameter the result of the operation on the first two 
parameters is stored into the third parameter. For example, the following call adds the extended 
integers el and e2 and stores the result in e_sum : 

/* C version */ 

#include <sys/estat.h> 
esize__t el, e2, e_sum; 
e_sum = eadd(el, e2); 


c Fortran version 

integer el(2), e2(2), e_sum(2) 
call eadd(el, e2, e_sum); 

If you want to add an ordinary integer to an extended integer, you must create your own extended 
integer from the desired integer value. To create an extended integer, use stoe(). This call takes a 
string whose value is a number, and returns the corresponding numeric value as an extended integer. 
For example, the following code fragment adds 1 to the value of the extended integer el . It does this 
by converting the string "1" to an extended integer with stoe(), storing the resulting extended integer 
in e2, and then adding e2 to el (note that in Fortran the string must be declared to be one character 
larger than the actual string being converted): 

/* C version */ 

#include <sys/estat.h> 
esize__t el, e2, e_sum; 
char *one = "1"; 

e2 = stoe(one); 
e_sum = eadd(el, e2); 


c Fortran version 

character*2 one 
parameter (one ='l') 
integer el(2), e2(2), e_sum(2) 

call stoe(one, e2) 
call eadd(el, e2, e_sum) 

The other extended arithmetic calls allow you to subtract, multiply, divide, and find the remainder 
after division of extended integers. When you use edivO or emodQ, the divisor and answer must be 
4-byte integers, not extended integers. Similarly, when you use emul(), the second argument must 
be a 4-byte integer, not an extended integer. 

You can also compare two extended integers; ecmp() returns -1,0, or 1, depending on whether the 
first extended integer is less than, equal to, or greater than the second. 
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Getting Information About PFS File Systems 


Synopsis 

Description 

getpfsinfo(£w/) 

Get PFS-specific information about all mounted 
PFS file systems. 

statptsfpath, fs_buffer, pfs_buffer , pfe_bufsize ) 

Get PFS-specific and non-PFS-specific 
information for the file system containing path. 

fstatpfs(/? Ides , fs_buffer, pfs_buffer, 
pfsbufsize) 

Get PFS-specific and non-PFS-specific 
information for the file system containing the file 
open on fildes. 


The functions getpfsinfoO, statpfsO, and fstatpfsO let C programmers get information about PFS 
file systems. These functions are not available to Fortran programs. See “PFS File Systems and PFS 
Files” on page 5-3 for more information on the concepts discussed in this section. 


Getting Information About All Mounted PFS File Systems 

getpfsinfo() gets information about all mounted PFS file systems. It is similar to the standard OSF/1 
call getmntinfoO, except that instead of returning information in an array of staffs structures, it 
returns information in an array of pfsmntinfo structures. It allocates the memory for this array of 
structures, each of which describes one PFS file system, and stores a pointer to this array into its 
argument. getpfsinfoO returns the number of elements in this array. The pfsmntinfo structure, 
defined in the header file <pfslpfs.h>, contains the following fields: 

mjnntonname Directory on which the PFS file system is mounted. 

m_statpfs statpfs structure that describes the PFS file system. 

The statpfs structure, also defined in the header file <pfs/pfs.h>, describes the PFS-specific 
attributes of a file system. This is a variable-size structure. It contains the following fields: 

pjeclen Total size of this statpfs structure, in bytes. 

p sunitsize Stripe unit size for this PFS file system, in bytes. 

p_sfactor Number of stripe directories within this PFS file system. 

p_sdirs list of stripe directories within this PFS file system. The number of 

pathnames in the list is specified by p_sfactor. 
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Each pathname in pjsdirs is a structure of type pathnameJ (defined in <pfsipfs.h >); you can use 
the NEXTPATHO macro defined in <pfslpfs.h> to examine each pathname in turn. 

Here’s an example of getpfsinfoO: 

#include <sys/types.h> 

#include <nx.h> 

#include <pfs/pfs.h> 

main() { 

struct pfsmntinfo *pfsinfo; 
struct statpfs *sattr; 
pathname_t *sdir; 

int cnt, i, incr; 

cnt = getpfsinfo(&pfsinfo); 

if(cnt == 0) { 

printf("No PFS file systems mounted\n"); 

} else { 

for(i =0; i < cnt; i++) { 

printf ("Mount point: %s\n", pfsinfo->m_mntonname); 

sattr = &(pfsinfo->m_statpfs) ; 
printf(" Stripe unit size: %d\n", 
sattr->p_sunitsize); 

printf(" Stripe factor: %d\n", sattr->p_sfactor); 

sdir = &(sattr->p_sdirs); 
printf(" Stripe directories:\n"); 
for(i = 0; i < sattr->p_sfactor; i++) { 
printf(" %s\n " r sdir->name); 

sdir = NEXTPATH(sdir); 

} 


incr = sizeof (pfsinfo->m__mntonname) 

+ sattr->p_reclen; 

pfsinfo = (struct pfsmntinfo *)((char *)pfsinfo 


This program prints out the attributes of all mounted PFS file systems, something like the command 
showfs -t pfs. Note that you must use the NEXTPATHO macro to step through the pjsdirs field of 
the statpfs structure, and you must increment the pointer into the array of pfsmntinfo structures by 
the size of the current pfsmntinfo structure (using the value of its pjreclen field). 
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Getting PFS Information About a Single File System 

statpfsO gets information about a file system given the pathname of a file or directory in that file 
system; fstatpfsO gets information about a file system given the file descriptor of an open file in that 
file system. 

These functions get both general and PFS-specific information about the specified file system. They 
can be used on both PFS and non-PFS file systems, but they return PFS-specific information only 
for PFS file systems. They are similar to the standard OSF/1 calls statfsO and fstatfsO, except that 
instead of returning information in a staffs structure, they return information in an estatfs structure 
and a statpfs structure. 

• The estatfs structure, defined in the header file <pfslpfs.h>, describes the basic attributes of the 
file system. It is just like the statfs structure defined in <syslmount.h>, except that some of its 
fields are of type esizej (see “Performing Extended Arithmetic” on page 5-37 for information 
on this type). This is necessary because some of the values returned for PFS file systems are too 
large to be stored into an ordinary integer. 

Some of the more generally useful fields of the estatfs structure are: 

fjype The type of the file system, expressed as a constant such as 

MOUNTUFS, MOUNT_NFS, or MOUNT_PFS (these constants are 
defined in <syslmount.h>). 

f_bavail Number of free 1024-byte disk blocks in the file system available to 

ordinary users, expressed as a value of type esizej. 

f mntonname Directory on which the file system is mounted, expressed as a string. 

fjmtfromname Device name of the file system, expressed as a string. 

See statpfsO in the Paragon ™ C System Calls Reference Manual for a complete description of 
all fields in the estatfs structure. 

• The statpfs structure is the same statpfs structure described for getpfsinfoO in the previous 
section. However, the way it is returned is different: getpfsinfoO allocates space for several 
statpfs structures and returns you a pointer to this space, but statpfsO and fstatpfsO store 
information in a statpfs structure that you provide. 

Because the statpfs structure is variable-size, you must tell statpfsO and fstatpfsO how big your 
statpfs structure is; you do this with the third parameter of statpfsO and fstatpfsO (called 
pfs_bufsize). Then you must check the pjeclen field in the returned statpfs structure to be sure 
the returned information fit in your provided structure; if it didn’t, try again with a larger 
structure. 
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Here’s an example of statpfs(): 

#include <sys/types.h> 

#include <sys/mount.h> 

#include <malloc.h> 

#include <nx.h> 

#include <pfs/pfs.h> 

#define SDIRS_INIT_SIZE 1024 

main(int argc, char **argv) { 
struct statpfs *statpfsbuf; 
int bufsize; 

struct estatfs estatbuf; 
pathname_t *sdir; 

char blocks[80]; 

int i; 

if(argc != 2) 

{ 

printf("Usage: %s <mountpoint>\n", argv[0]); 
exit(1); 

} 

bufsize=sizeof(struct statpfs) + SDIRS_INITLSIZE; 

statpfsbuf=(struct statpfs *)malloc(bufsize); 

if(statpfs(argv[l], &estatbuf, statpfsbuf, bufsize) < 0) 

{ 

nx__perror ( ” statpfs " ) ; 
exit(l); 

} 

if(statpfsbuf->p_reclen > bufsize) 

{ 

bufsize=statpfsbuf->p_reclen; 

statpfsbuf =(struct statpfs *) realloc(statpfsbuf, 

bufsize); 

if(statpfs(argv[l], sestatbuf, statpfsbuf, bufsize) 

< 0) 

{ 

nx_perror("statpfs"); 
exit(1); 

} 

} 
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printf("Selected PFS statistics for %s:\n", argvfl]); 

/* From estatfs structure */ 

printf(" File system type: %d\n", estatbuf.f_type); 
etos (estatbuf . f__bavail, blocks); 

printf(" # of IK blocks available: %s\n", blocks); 

printf(" Mount point: %s\n", estatbuf.f_mntonname); 
printf(" Device name: %s\n", estatbuf.fjomtfromname); 

/* From statpfs structure */ 

printf(" Stripe unit size: %d\n”, 
statpfsbuf->p_sunitsize); 

printf (" Stripe factor: %d\n" , statpfsbuf->p__sfactor) ; 

printf(" Stripe directories:\n"); 
sdir = &(statpfsbuf->p_sdirs)? 
for (i = 0; i < statpfsbuf->p_sfactor; i++) { 
printf(" %s\n”, sdir->name); 

sdir = NEXTPATH(sdir); 

} 

} 

This program prints out the attributes of the file system containing the file specified by its first 
argument. Note that you must allocate enough space for the statpfs structure plus the stripe directory 
pathnames and check the returned pjeclen against the currently-allocated size of the structure 
0 bufsize ). 

This example starts off by allocating an extra SDIRS_INIT JSIZE bytes (an arbitrary value) for the 
stripe directory pathnames. If pjeclen is larger than the size of the structure, this example uses 
reallocO to enlarge the structure and calls statpfsO again. It then uses the NEXTPATHO macro to 
step through the pjdirs field of the statpfs structure, as discussed earlier for getpfsinfoO. 
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Controlling Tape Devices 

Description 

Perform an operation on an open tape or other 
device. 


Synopsis 

ioctl (fd, request , argp) 


You can use standard OSF/1 I/O calls or parallel I/O calls to open, read, and write tape devices. To 
control tape devices, use the standard OSF/1 system call ioctl(). The header file <syslmtio.h> 
defines the tape-specific structures and constants you need. 


NOTE 

Only one node at a time can open a tape device, and it must use 
I/O mode MJJNIX (0). 


<syslmtio.h> defines two constants you can use as the second argument of ioctl(): 
MTIOCTOP Perform operation on tape. 

MTIOCGET Get status of tape. 

The rest of this section explains the details of using these constants. 


Naming Tape Devices 

The Paragon OSF/1 operating system uses the following conventions for naming tape devices: 

IdevliotUrmtX Raw cartridge tape, rewinds automatically when closed. 

IdevliotllnrmtX Raw cartridge tape, does not rewind automatically when 

closed. 

Idev/ioN/rmtcX. Raw cartridge tape with compression, rewinds 

automatically when closed. 

Idevlio&InrmtcX Raw cartridge tape with compression, does not rewind 

automatically when closed. 
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NOTE 

The rmtc devices can only be used with tape drives that support 
data compression. 


In each case, # is the node number of the I/O node to which the tape device is connected, and & is 
the SCSI ID of the tape device (typically 6). So, for example, to use the cartridge tape device with 
SCSI ID 6 on the boot node (node 0) and have it rewind automatically when closed, use the pathname 
Idevlw0lrmt6. To use the same device but have it not rewind automatically when closed, use the 
pathname Idevlio0lnrmt6. 


Performing Operations on Tape Devices 

When you call ioctlO with MTIOCTOP as its second argument, you must use a structure of type 
mtop as the third argument The mtop structure is defined as follows: 

struct mtop { 

short mt__op? /* operation to perform */ 

short fill? /* ignored */ 

long mt_count; /* how many operations to perform */ 

}; 

This structure tells ioctlO what operation to perform. The valid values of the mt op field include the 


following constants: 


MTWEOF 

Write mt_count end-of-file marks. 

MTFSF 

Space the tape forward by mt_count files. 

MTBSF 

Space the tape backward by mt_count files. 

MTFSR 

Space the tape forward by mt_count records. 

MTBSR 

Space the tape backward by mt_count records. 

MTREW 

Rewind the tape. If the tape has been written to, writes two end-of-file marks 
before rewinding. (Two end-of-file marks indicate the end of data.) 

MTOFFL 

Rewind the tape and put the drive offline. If the tape has been written to, 
writes two end-of-file marks before rewinding. 

MTNOP 

No operation, sets status only. 

MTRETEN 

Retension the tape. 
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MTERASE Erase the entire tape. 

MTEOM Position the tape at end of media (SCSI only). 

Closing the tape device after writing to it also writes an end-of-file mark (or two end-of-file marks 
if the tape was opened in variable-block mode or the tape mode “rewind” is set). If the tape was 
opened in variable-block mode, the tape head is then positioned between the two end-of-file marks, 
so that any subsequent write will overwrite the second one. 

For example, the following C program rewinds the tape on the device connected to Idevlio0lrmt6 : 

#include <fcntl.h> 

#include <errno.h> 

#include <sys/mtio.h> 

main() { 

int fd; 

struct mtop s; 

fd = open( "/dev/io0/rmt6", 0_RD0NLY, 0666); 
if(fd == -1) { 

perror("opening /dev/io0/rmt6"); 
exit(1); 

} 

s.mt_op = MTREW; 
s.mt_count = 1; 

if (ioctl(fd, MTIOCTOP, &s) == -1) { 
perror("rewinding tape"); 
exit(2); 

} 

} 


Getting Status of Tape Devices 

When you call ioctlO with MTIOCGET as its second argument, you must provide a structure of 
type mtget as the third argument. The mtget structure is defined as follows: 

struct mtget { 

short mt_type; /* type of magtape device */ 
short mt_dsreg; /* ''drive status'' register */ 
short mt__erreg; /* '' e **ror'' register */ 
short mt_resid; /* residual count */ 

}; 
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focdO fills in the elements of this structure with information about the device. The value of the 
mtjype field is always OxOC (indicating a generic SCSI device). The values of the mt jbreg and 
mt_erreg fields are device-dependent. 

For example, the following C program prints the status of the device connected to Idevlio0lrmt6: 

#include <fcntl.h> 

#include <errno.h> 

#include <sys/mtio.h> 

main() { 

int fd; 

struct mtget s; 

fd = open( Vdev/io0/riat6" / 0__RD0NLY , 0666); 
if(fd == -1) { 

perror("opening /dev/io0/rmt6"); 
exit(1); 

} 

if (ioctl(fd, MTIOCGET, &s) == -1) { 
perror("getting status of tape"); 
exit(2); 

} 

printf("mt_type = 0x%x\n", s.mt_type); 
printf("mt_dsreg = 0x%x\n", s.mt_dsreg); 
printf("mt_erreg = 0x%x\n", s.mt_erreg); 
printf ( "mt_resid = 0x%x\n" , s .mt__resid) ; 

} 
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Synchronization Summary 

Table 5-4 lists the I/O modes and summarizes the I/O calls that are synchronizing calls in each one. 
Table 5-5 lists the most commonly-used I/O calls and summarizes the I/O modes that cause them to 
become synchronizing calls. 


Table 5-4. Synchronization in Each I/O Mode 


I/O Mode 

I/O Calls that Synchronize 

MUNLX 

gopen() and setiomodeO 

MJLOG 

gopen(), setiomodeO and closeO 

M_SYNC 

All 

MRECORD 

gopenO, setiomodeO, IseekO, eseekO, and close/) 

M_GLOBAL 

All 


Table 5-5. File I/O Calls that Synchronize 


Cali 

I/O Modes Causing the Call to Synchronize 

closeO 

MJLOG, MSYNC, MRECORD. and MGLOBAL 

creadO and creadvO 

MSYNC and M GLOBAL 

cwriteO and cwritevO 

M SYNC and M GLOBAL 

eseekO 

MJSYNC, M RECORD. and M GLOBAL 

gopenO 

All 

ireadO and ireadvO 

M SYNC and M GLOBAL 

iseoPO 

M_SYNC and M GLOBAL 

iwriteO and iwritevO 

M SYNC and M GLOBAL 

IseekO 

M_SYNC, M RECORD. and M GLOBAL 

setiomodeO 

All 









































Introduction 

In Paragon OSF/1, each process consists of a set of resources, such as memory objects and open 
files, and one or more threads (short for threads of control). Each thread consists of an instruction 
pointer and a stack. 

By default, each process has only one thread. When there is more than one thread in a single process, 
each thread executes independently, but they share resources. For example, all the threads in a single 
process share memory; when one thread writes to a variable in memory, it modifies the value of that 
variable for all threads. 

Because threads share memory, you must carefully coordinate access to shared areas of memory. For 
example, if two threads write to the same area of memory at the same time, the results may be 
indeterminate. You can use arbitration mechanisms such as mutexes (short for “mutual exclusion”) 
to protect areas of memory from being accessed at the same time. Software that uses these 
techniques and can be safely executed by two or more threads at the same time is referred to as 
thread-safe or reentrant. 


The Pthreads Package 

To create and manage threads, you should use the pthreads package. Pthreads is short for POSIX 
threads', the threads created and managed by this package are also referred to as pthreads. 

The current Paragon OSF/1 implementation of pthreads is based on the POSIX Threads Extension 
[C language] P1003.4a!D4 (Draft 4), August 1990; however, it is not strictly conformant to this 
draft. Also, note that this is not the most recent draft of this extension, so Paragon OSF/1 pthreads 
programs may not be portable to or from other systems. Future versions of the Paragon OSF/1 
pthreads package may be based on later drafts of this extension, and may not be compatible with the 
current version. In particular, the treatment of thread cancellation and signals may be different in the 
future. 


6-1 







Using Pth reads 


Paragon™ User's Guide 


The pthreads package consists of the following libraries: 

libpthreads.a Contains thread management calls, such as pthreadcreateO- The calls in 
this library are discussed in “Using Pthreads Library Calls” on page 6-11. 

libc_r.a Contains reentrant versions of standard C library ( libc.a ) calls, such as 

printfl). The calls in this library are discussed in “Using Reentrant C Library 
Calls” on page 6-6. 

The only programming interface to the pthreads package provided in the current release is for the C 
programming language. No Fortran interface is provided. 


NOTE 

Pthreads are not the same as, and are not compatible with, 
“cthreads” or “Mach threads.” 


These other types of thread are not supported for use in user programs. These types of thread are not 
compatible with libc_r.a and cannot be recognized or managed by the calls in libpthreadsM. 

What’s In This Chapter 

This chapter introduces the Paragon OSF/1 pthread package and its usage in parallel applications. It 
includes the following sections: 

• Limitations of pthreads. 

• Recommended safe operating environment 

• Compiling and linking a pthread application. 

• Using reentrant C library calls. 

• Using pthreads library calls. 

• Interfacing with non-thread-safe code. 

• Message passing and pthreads library calls. 

• Hie I/O and pthreads library calls. 

• nx_nfork() and nxinitveO and pthreads library calls. 
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* Signals and pthreads library calls. 

• Handling errors. 


Limitations of Pthreads 

The pthreads package has the following limitations in the current release: 

• Currently, none of the libraries supplied with Paragon OSF/1 (except for libpthreads.a and 
libc_r.a) are thread-safe. In particular, the library libnx.a, which contains all the calls discussed 
in the other chapters of this manual, is not thread-safe. In a process containing multiple pthreads, 
any calls to non-thread-safe libraries must be protected so that no two pthreads can call the same 
library at the same time. See “Interfacing with Non-Thread-Safe Code” on page 6-37 for 
information on how to do this. 

• Any global variables used or set by a non-thread-safe library may also have to be protected. For 
example, if a non-thread-safe function sets the global variable ermo, you must be sure to read 
the value of ermo before allowing any other pthread to make any call that could change the 
value of ermo. See “ermo Confusion” on page 6-41 for more information on ermo. 

• In the current implementation, all the pthreads in a process always run on the same processor. 
Scheduling of pthreads is handled by the kernel, which uses a policy of time sharing with aging. 
You cannot control, or get information about, pthread scheduling by using pthread library calls. 

• Pthreads use kernel resources as well as user-level resources (to be specific, each pthread uses 
one kernel thread). This means that using very large numbers of pthreads can exhaust certain 
resources within the kernel. 

• The POSIX Threads Extension [C language] P1003.4a!D4 (Draft 4), August 1990 includes an 
optional feature called “thread priority scheduling.” This feature is not available in the current 
release. If you attempt to make use of this feature, you will get compilation errors (for use of an 
unsupported data type), link errors (for use of an unsupported library call), or run-time errors 
(for use of an unsupported system call). If a run-time error occurs, the call fails with the ermo 
value ENOSYS. 

• The Paragon application development tools currently are not thread-aware and do not have any 
features to support pthreads. 

• There is no Fortran interface to the pthreads package. If you must use pthreads in a Fortran 
program, you could make the calls to the pthreads library from a C function, which can then be 
compiled to a .o file and linked into the Fortran program. However, this programming model 
has not been tested. 
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Recommended Safe Operating Environment 

The previous section described the limitations which cannot be exceeded. This section recommends 

limitations which you should not exceed in the current release. Exceeding these limitations may 

result in unexpected behavior, up to and including system crashes and data loss. 

• No process should have more than 6 pthreads at once. 

Any pthread can create or terminate another pthread at any time. The system does not impose a 
limit on the number of active pthreads in a process, on a node, or in the whole system. However, 
the total active pthreads per process (including the main thread) should be kept at or below 6. 
Exceeding this limit may result in an emulator exception. 

• Only one pthread in a process should use the message-passing calls described in Chapter 3. The 
message-passing pthread can be the main thread or another pthread, but a pthread other than the 
main thread will experience higher message latency than the main thread. 

This limitation is due to the fact that the message-passing library ( libnx.a ) is not thread-safe. 
Also, there is no mechanism in current message passing calls to send or receive messages to or 
from a specific pthread within a process. See “Message Passing and Pthreads Library Calls” on 
page 6-37 for more information. 

If more than one pthread in a process attempts to perform message passing, message-passing 
performance may degrade, incorrect information may be returned from an info...() call, and 
global operations such as gsync() may give unexpected results. 

• All global operations (such as gsyncO) must be performed by the message-passing pthread. This 
is necessary because all global operations use message-passing to synchronize the nodes. You 
can synchronize pthreads within a process by using a global variable counter as a barrier. 

• Only one pthread in a process should use the parallel hie I/O calls described in Chapter 5. The 
I/O pthread can be the main thread or another pthread. See “File I/O and Pthreads Library Calls” 
on page 6-38 for more information. 

• The calls gopenO and setiomodeO use message-passing internally. If the I/O pthread is not also 
the message-passing pthread, you must make sure that these calls are not used at the same time 
as any message-passing calls in the message-passing pthread. 

• The standard OSF/1 file I/O system calls, such as read() and write(), can be called from 
multiple pthreads at the same time if they are called from a controlling process and they are only 
used with files that reside in UFS file systems. Otherwise, only one pthread in a process can use 
them. 

• Applications with multiple pthreads should not be run in gang-scheduled partitions. Gang 
scheduling of threaded applications has not been thoroughly tested, but has been known to cause 
server exceptions. 
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• Do not call sigwaitO to wait on synchronous signals (those that are generated synchronously as 
the results of a pthread's faults, such as SIGBUS and SIGSEGV). Doing this may cause the 
application or the system to crash. See “Managing Signals” on page 6-34 for more information 
on sigwaitO. 

• Do not use calls from libpthreads.a or libc_r.a within a signal handler. Some of these calls use 
mutexes internally, which may result in deadlock (the handler can find itself waiting on an 
unavailable mutex lock, while the mutex lock cannot be released until the signal handler has 
returned). 

• Asynchronous cancellation is very destructive and should be avoided. In particular, attempting 
to cancel a pthread doing file I/O on a PFS or UFS file system can cause the entire application 
to hang. See “Canceling Pthreads” on page 6-28 for more information on asynchronous 
cancell atioa 

• Do not attempt to use the Paragon application development tools on applications with multiple 
pthreads. In particular, IPD currently is not thread-aware and should not be used to debug or 
profile an application that uses pthreads. 

• Do not terminate a process when other pthreads are progressing. For example, calling exit() or 
returning from mainO kills all threads and terminates the entire process. If there are any other 
pthreads in the process, including pthreads generated transparently by library calls, null 
processes may result. Be sure to terminate all pthreads gracefully before terminating the 
program. In particular, be sure that all asynchronous and interrupt-driven message and I/O 
operations (such as hrecvO or ireadO) are complete before the program terminates. It is also a 
good idea to ensure that mainO always exits by calling pthread_exit(), never by calling exit() 
or by reaching the closing brace of mainO. 

• No Mach calls can be used in a pthreads program. The Mach kernel interface ( libmach.a ) is not 
supported in the current release; use of Mach features in pthreads programs can cause pthreads 
internal errors or system crashes. 


Compiling and Linking a Pthread Application 

When compiling a program that uses the pthreads package, you should define the symbol 
REENTRANT; this symbol ensures that thread-safe definitions are used in all included header 
files. (The compiler switch -M[no]reentrant does not have any effect on whether or not the 
resulting code is thread-safe. It only determines whether or not the code can be called recursively.) 

When linking a program that uses the pthreads package, you must link in the library libpthreadsxi, 
followed by the library libc_r.a. The standard C library is also linked by default, but only after 
searching all libraries specified on the command line. If you use the -nx switch, it can appear on the 
command line either before these two libraries or after them; if you use the -lnx switch, it should 
appear after both these libraries. 
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For example: 

• To compile and link a non-parallel program: 

% cc -D__REENTRANT -o node node.c -lpthreads -Ic_r 

• To compile and link a controlling process: 

% cc -D_REENTRANT -o node node.c -lpthreads -Ic_r -lux 

• To compile and link a parallel application: 

% cc -D_REENTRANT -o node node.c -lpthreads -lc_r -m 

Using Reentrant C Library Calls 

Only the calls in the reentrant C library ( libc_r.a ) and the calls in the pthreads library (libpthreads.a) 
are guaranteed to be thread-safe in the current release. Any calls to other libraries must be protected 
so that no two pthreads can call them at the same time. Table 6-1 lists the calls in libe ra. 



Paragon™ User's Guide 


Using Pthreads 


Table 6-1. Calls in Reentrant C Library {libc_r.a) (1 of 2) 


abortO 

absO 

acceptO 

accessO 

acctO 

adjtime() 

advanceO 

alarmO 

ailocaO 

asctimerO* 

async_daemon() 

atexit() 

atofQ 

atoiO 

atoiO 

bcmpO 

bcopyO 

bindO 

brkO 

bzeroO 

callocO 

catdoseO 

catgetsO 

catopenO 

chdir() 

chmodO 

chownQ 

chrootO 

dearerrO 

dockO 

doseO 

dosedirO 

connectO 

creatO 

ctermidO 

ctimerO* 

cuseridO 

dbm_doseO 

dbmfetchO 

dbmopenO 

dupO 

dup2() 

ecvtj-O* 

endfsent() 


endgrentO 

endpwentO 

endttyentO 

endusershdlQ 

endutentO 

exec_with_loader() 

exedpO 

execvpO 

exit() 

&bs() 

fchdirO 

fchmodO 

fchown() 

fdoseO 

fcntl() 

fcvtjrO* 

fdopen() 

feofO 

ferrorO 

fflushO 

ffs() 

%eteO 

%ets() 

filenoO 

QockO 

flockfileO* 

fopenO 

fpathconf() 

fpgetmask() 

fpgetroundO 

fygetstickyO 

fprintfO 

fpsetmaskO 

fpsetroundO 

fpsetstkkyO 

fjputcO 

fputsO 

fireadO 

fireeO 

freopen() 

frexpO 

fscanfO 

£seek() 

fetatO 


fstatfsO 

fsync() 

ftellO 

ftruncateO 

ftw() 

funlockfileO* 

fwriteO 

gevtO 

getaddressconfQ 

geteO 

getcharO 

getdockO -- 

getcwdO 

getdirentriesO 

getdtablesizeO 

getegidO 

getenvO 

geteuldO 

getfhO 

getfsent_rO* 

getfsfilerO* 

getfcspecjrO* 

getfsstat() 

getgidO 

getgrent_r()* 

getgrgid_rO* 

getgmam_r()* 

getgroupsO 

gethostid() 

gethostnameO 

getitimerO 

getloginrO* 

getpagesizeO 

getpeemame<) 

getpgrpO 

getpidO 

getppidO 

getpriorityO 

getpwent_rO* 

getpwnamrO * 

getpwuidrO* 

getrlimltO 

getrusageO 

getsO 


getsocknameO 

getsockoptO 

gettimeofdayO 

gettimer() 

getttyentrO* 

getttynam_rO* 

getuidO 

getusershdlrO* 

getutentrO* 

getutid_r()* 

getutlinerO* 

getw() 

getwcO 

getwd() 

gmtimeO 

gmtimerO* 

htonlO 

htonsO 

initstateO 

initstate_rO* 

insque() 

ioctlO 

isalnumO 

isattyO 

isdlgitO 

isnanO 

isnandO 

isnanf() 

isspaceO 

isupperO 

isxdigitO 

kiUO 

ldexp() 

linkO 

listenO 

localtime_rO* 

longjmp() 

lseekO 

IstatO 

madviseO 

mallocO 

memccpyO 

memchrO 

memcmpO 


memcpyO 

memmoveO 

memset() 

mkdirO 

mknodO 

mkstempO 

mktempO 

mktinie<) 

mktimerO 

mmap() 

mod ft) 

mount() 

mprotectO 

msem_init() 

msemlockO 

msemremoveO 

msem_unlock() 

msgcdO 

msggetO 

msgrcvO 

msgsndO 

msyncO 

munmap() 

mvalidO 

nfssvc() 

nice() 

nllangjnfoO 

ntohlO 

ntohsO 

open() 

opendirO 

pathconfO 

perrorO 

P>pe() 

plockO 

pollO 

printfO 

profUO 

ptraceO 

putcO 

putcharO 

putsO 

pututlinerO* 

putwO 


* Does not exist in the standard C library ( libc.a ). 
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Table 6-1. Calls in Reentrant C Library ( libc_r.a ) (2 of 2) 


putwc() 

semctlO 

setttyent() 

strchrO 

ulimitO 

quotactlO 

semgetO 

setuidO 

strcmpO 

umaskO 

raisef) 

semopO 

setusershell() 

strcpyO 

umountO 

randO 

sendO 

setutent() 

strcspnO 

uname!) 

randr!)* 

sendmsgO 

setvbufO 

strdupO 

ungetcO 

random() 

sendtoO 

shmatO 

strerrorr!)* 

unlink!) 

randomrO* 

setbufO 

shmctlO 

strftimeO 

unlockedfclose!)* 

re comp _r!)* 

setbuffer!) 


string!) 

unlocked_fflush()* 

re exec rO* 

setclockO 


strien() 

unlockedfread!)* 

read() 

setfsent() 

shutdownO 

stmcatO 

unlocked_fseek()* 

readdirO 

setgidO 

sigactionO 

strncmpO 

unlocked_fwrite()* 

readdir_rO* 

setgrentO 

sigaddsetO 

strncpyO 

unlockedjgetc!) * 

readlinkO 

setgroups() 

sigdelsetO 

strpbrk() 

unlocked_getchar!) * 

readv() 

sethostidO 

sigemptysetO 

strrchrO 

uniocked_getwc() * 

reallocO 

sethostnameO 

sigfillset() 

strspnO 

unlocked_putc()* 

rebootO 

setitimer() 

sigismemberO 

strtok_rO* 

unlocked_putchar()* 

recv() 

setjmpO 

signalO 

strtolO 

unlockedsetvbufQ* 

recvfromO 

setlinebufO 

sigprocmaskO 

strtoulO 

utimesQ 

recvmsgO 

setlocaleO 

sigreturnO 

swaponO 

utmpnameO 

reltimerO 

setloginO 

sigstackO 

symlinkO 

vfork!) 

remque!) 

setpgidO 

sigsuspend() 

syncO 

vfprintf() 

rename() 

setrerpO 

sleepO 

sysconfD 

vprintfQ 

revoke() 

setpriorityO 

socketO 

tableO 

vsprintfO 

rewind() 

setpwent() 

socketpairO 

tempnamO 

wait4!) 

rewinddirO 

setregidO 

sprintfO 

timeO 

waitpidO 

rindex() 

setreuidO 

srandO 

times!) 

write!) 

rmdirO 

setrlimit() 

srandom() 

tmpfileO 

writevO 

rmknodO 

setsidO 

srandom_r()* 

tmpnam() 


rmtimer() 

setsockoptO 

sscanfQ 

tolowerO 


sbrkO 

setstateO 

statQ 

truncate!) 


scanfQ 

setstate_rO* 

statfs() 

ttyname_rO* 


select() 

settimeofdayO 

strcatO 

tzsetQ 


* Does not exist in the standard C library ( libc.a ). 
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The calls in libc_r.a can be divided into three groups, according to their names: 

• Most of the calls in libc_r.a have the same names as calls in the standard C library ( libc.a ). 
These calls generally work the same as the equivalent calls in libc.a. However, they perform 
special checks and locks internally to be sure they will work if called by multiple pthreads at the 
same time. Also, many blocking system calls only block the calling pthread instead of every 
pthread of the process. The following commonly-used calls have the following effects in 
programs with multiple pthreads: 

exitO Kills all pthreads of the process and closes all opened files. See “Calling 

exitO” on page 6-42 for more information on using exitO in programs 
with multiple threads. 

forkO Copies only the calling pthread to the new process’s address space. If a 

mutex lock is held by another pthread, then the calling pthread in the new 
process may deadlock. For example, if a process has 2 pthreads and 
pthread 0 calls forkO when pthread 1 is holding a mutex lock inside a call 
to printfO, the only pthread in the new process will hang when it calls 
printfO. 

execO Kills all pthreads other than the calling pthread, then loads a new 

program into the process’s address space. The result is a new program 
with one pthread. The calling pthread can create additional pthreads if it 
wants. 

chdirO Changes the current working directory for all pthreads in the calling 

process. 

sleepO Puts only the calling pthread to sleep. 

wait() Blocks only the calling pthread; does not return until every pthread of the 

process being waited for exits. Note that a pthread can waitO for a 
process created by a different pthread. 

perrorO Uses the per-pthread ermo (see “ermo Confusion” on page 6-41). 

Note that these calls may have different semantics on different platforms: 
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• libc_r.a also includes some calls whose names end in _r. These calls do the same thing as the 
similarly-named calls in the standard C library, but have different parameters. 

If the _r call is the only version of the call in libc_r.a, you must use the _r call to be sure 
your code is thread-safe. This is the case for most of the _r calls. 

If both an _r and a non-_r version of the call exist in libc_r.a, both versions are thread-safe, 
but the _r version offers better performance. This is the case for gmtime(), initstate!), 
randO, random!), readdirO, setstateO, and srandom(). 

The _r calls are noted with an asterisk in Table 6-1. 

• Finally, the calls flockfileO, fiinlockfileO, and unlocked ...!) exist only in libc_r.a : 

flockfileO Locks the specified standard I/O stream for exclusive use by the calling 

pthread. 

funlockfile!) Unlocks the specified stream. 

unlocked_...0 Perform operations on a stream while it is locked (these are called 

“unlocked” calls because they do not perform any locking or unlocking 
of their own). 

The unlocked ...!) calls are not thread-safe by themselves; they must be used together with 

flockfileO/funlockfileO. 

These calls offer better I/O performance and more control over I/O from pthreads than the 
standard thread-safe I/O calls. For example, the thread-safe version of putc() locks out all other 
I/O calls, writes the specified character, then unlocks. If you write a series of characters to a file 
with putcO, this locking and unlocking results in considerable overhead; also, there is nothing 
to prevent characters written by two different pthreads from becoming intermingled. 

You can instead use flockfileO to lock out all other operations on the file, a series of 
unlockedjputcO calls to write characters without locking aid unlocking, and finally a 
funlockfile!) to release the lock. In this case only one pair of lock/unlock operations is 
performed; your I/O performance will be better, and no other pthread’s output can interfere. See 
the OSFI1 Programmer’s Reference for more information on these calls. 
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Using Pthreads Library Calls 

This section tells you how to use the calls in libpthreads.a to create and control pthreads in your 
programs. See the OSFI1 Programmer’s Reference for more detailed information on each call. 


Pthreads Library Data Types and Symbols 

In order to use any calls from the pthreads library, your program must include the file <pthread.h>, 
which defines several types and symbols used by this library. The most important of these are: 

pthreadj Within each process, each active pthread is identified by a unique pthread ID, 

which is a value of type pthreadj. You use a pthread’s ID to identify the 
pthread in all calls that control pthreads. 

pthreadjnuiexj and pthread condj 

Each active mutex is identified by a value of type pthread jnutex t, and each 
active condition variable is identified by a value of type pthread_condj. 

pthreadattrt, pthreadjnutexattrj, and pthreadcondattrt 

Objects of these types, called attributes objects, are used to specify the 
attributes (characteristics) of pthreads, mutexes, and condition variables. 
These types are extensible, and can support new features added by later 
revisions of the pthreads standard while maintaining compatibility with 
existing programs. Objects of these types are created with default values and 
can be changed by pthreads library calls. 

pthreadattrdefault, pthread mutexattr default, and pthread condattr default 

These are external symbols whose values are the default attributes for an 
object of the appropriate type. If you want to create an object with the default 
attributes, you can use one of these symbols instead of creating a new 
attributes object with the default attributes. 

pthreadjcey t This data type supports the per-pthread global data structure in the pthreads 
library. This enables different functions to access global data that only 
belongs to a single pthread. 
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The Main Thread 

Each program initially has a single thread—the flow of control that starts at the beginning of the 
function main(). This thread is referred to as the main thread. 

Any other pthreads in the program are created by the main thread, either directly or indirectly. But 
threads do not have a parent-child relationship, as processes do, so the main thread does not have 
any special relationship with or control over other pthreads in the process. 

However, the C library treats the function mainO specially, in a way that can affect other threads in 
the process: 


NOTE 

If the function mainO returns (either by executing a return 
statement or by reaching the closing brace of the function), the C 
library generates an implicit call to _exit(), which kills all pthreads 
in the process and terminates the process. 


This means that you must either have mainO wait for all other pthreads before returning, or make 
sure that main() always terminates itself by calling pthread_exit() rather than calling exit() or 
returning. 
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Managing Pthread Execution 


Synopsis 


Description 

int pthread_create( 
pthread_t *thread, 
pthread_attr_t attr, 
void *{*routine)(\ oid *arg), 
void *arg ); 


Creates a pthread. 

pthread_t pthread_sdf(void); 


Returns the ID of the calling pthread. 

int pthread_equal( 


Compares two pthread identifiers. 

pthread_t threadl, 
pthread_t threadl ); 



void pthread_yidd(void); 


Allows the scheduler to run another pthread 
instead of the current one. 

void pthread_exit( 

void * status ); 


Terminates the calling pthread. 

int pthread Join( 

pthread_t thread, 
void **status ); 


Waits for a pthread to terminate. 

int pthread_detach( 
pthread_t *thread ); 


Detaches a pthread. 


To create a pthread, call pthread_create() . This call has the following parameters: 

thread Pointer to a variable of type pthread j that receives the pthread ID of the 

newly-created pthread. 

attr An object of type pthread attr t that describes the desired attributes of the 

new pthread. This can be the default pthread attribute object 
pthread_attr_default, or a user-created pthread attribute object (see 
“Managing Pthread Attributes” on page 6-15). 

routine Pointer to the initial function to be executed by the new pthread. This function 

is assumed to return void * and to have one argument of type void *. 

arg Value of type void * to be passed to the initial function as its argument. 


A pthread can determine its own pthread ID by calling pthreadselflQ, and a pthread ID can be 
compared against another pthread ID by calling pthread equaK). Note that pthread j is an 
“opaque” type, and you should not use standard C operators on it 
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The pthread_yield() call will decrease the priority of the calling pthread and give up the node’s 
processor to other pthreads that have higher priorities than the calling pthread. The kernel decides 
which thread to run next, based on its time sharing and aging policies. Eventually, the calling pthread 
will be scheduled to run again when other pthreads become lower priority pthreads. A pthread should 
call pthread_yiddO to give up the processor when it is making no progress or has no work to do. 

A pthread terminates when it calls pthread_exit() or returns from its initial function. However, the 
termination of a pthread does not release all the resources associated with the pthread. To release a 
terminated pthread's resources, a different pthread must call pthread JoinO or pthread_detach(): 

• pthread JoinO blocks until the specified pthread terminates, then releases the specified 
pthread’s resources and returns the exit status of the specified pthread to its caller. The exit 
status is the value specified in the pthread’s pthread jexitO call, or the return value of its initial 
function if it did not call pthread_exit(). 

• pthread jletachO tells the pthreads library to release the specified pthread’s resources, then 
returns immediately to its caller. Later, when the specified pthread terminates, the library 
releases the pthread’s resources and discards the pthread’s exit status. 

Any pthread that creates other pthreads should call pthread JoinO or pthread_detach() for each 
pthread it created before it terminates itself. 
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Managing Pthread Attributes 


Synopsis 

Description 

int pthread_attr_create( 
pthread_attr_t *attr ); 

Creates a pthread attributes object. 

int pthread_attr_setstacksize( 
pthread_attr_t *attr, 
long stacksize ); 

Sets the value of the stack size attribute of a 
pthread attributes object. 

int pthread_attr_delete( 

pthread_attr_t *attr ); 

Deletes a pthread attributes object. 

int pthread_attr_getstacksize( 
pthread_attr_t attr ); 

Returns the value of the stack size attribute of a 
pthread attributes object. 


The only pthread attribute that is currently modifiable is stack size. (A pthread’s priority and 
scheduling policy are managed by the kernel and cannot be inspected or changed.) To set a pthread’s 
stack size, use the following procedure: 

1. Call pthread attr createO to create a pthread attributes object (an object of type 
pthreadattrt). 

2. Call pthread_attr_setstacksizeO to set the stack size in that object. 

3. Use the modified pthread attributes object in the call to pthread_create() that creates the 
pthread. 

4. Call pthread_attr_delete() to remove the pthread attributes object. 

Once a pthread has been created, the size of its stack is fixed and can’t be changed. 

To use the default stack size, you can simply use the default pthread attribute object 
pthread attr default instead of creating your own pthread attributes object. 

You can use pthread_attr_getstacksize() to find out the cuirent stack size in a pthread attributes 
object. 


6-15 





Using Pth reads 


Paragon™ User's Guide 


NOTE 

Whenever possible, use the same stack size for all pthreads. 


Each pthread is built on a lower-level construct called a kernel thread. When you create a pthread, 
the pthread library tries to re-use a kernel thread from a pool of existing kernel threads. This means 
that creating a new pthread is more expensive if there are no existing kernel threads inside the 
pthreads library that can be reused. A major cause for the kernel being unable to recycle kernel 
threads is using a different stack size for new pthreads; this should be avoided. 


Managing Mutexes 


Synopsis 

Description 

int pthread_mutex_init( 
pthread_mutex_t *mutex, 
pthread_mutexattr_t attr ); 

Creates a mutex. 

int pthread_mutex_lock( 
pthread_mutex_t *mutex ); 

Locks a mutex. 

int pthread_mutex_trylock( 
pthread_mutex_t *mutex ); 

Tries once to lock a mutex. 

int pthread_mutex_unlock( 
pthread_mutex_t *mutex ); 

Unlocks a mutex. 

int pthreadjmutex_destroy( 
pthread_mutex_t *mutex ); 

Deletes a mutex. 


A pthread mutex is a binary semaphore with two states: locked and unlocked. When a mutex is 
created, its initial state is unlocked. Only one pthread at a time can lock a mutex. When a pthread 
successfully locks a mutex, it becomes the mutex’s owner. Any other pthread that attempts to lock 
the mutex will block until the owner unlocks the mutex. Mutexes cannot be used recursively: if the 
owner attempts to lock the mutex again, the attempt fails. 

You should use mutex locks to serialize pthread access to a block of code that accesses a 
nonshareable resource, such as a file or a non-thread-safe library. A pthread that is waiting on a 
mutex lock will not use any of the node’s processor time. 

To create and initialize a mutex, call pthread_mutex_init(). This call creates a new mutex with the 
attributes specified by attr (typically the default mutex attributes object pthread_mutexattr_default ) 
and stores the new mutex’s ID into the variable pointed to by mutex. A newly-created mutex is 
unlocked. 
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To lock a mutex, call pthreadmutexlockO . The call to pthreadmutexlockO will block the 
calling pthread until the mutex lock is available. A pthread waiting on a mutex lock will be scheduled 
out and another pthread will be scheduled to run. When the calling pthread is again scheduled to run 
because no higher-priority pthread can run, it checks the availability of the mutex lock again and is 
scheduled out again if the mutex lock is still unavailable. 

Note that there is no guarantee that a pthread waiting on a pthread mutex lockO will eventually 
get the lock. If you do not want to block until the lock is available, call pthreadmutextrylockO- 
This call tries once to lock the specified mutex. If the attempt succeeds, the call returns 1 
immediately; but if the mutex is already locked, the call returns 0 immediately. 

When a pthread is finished using the resource controlled by the mutex, it should release the lock by 
calling pthreadmutexunlockO- This allows any other pthread that has been waiting to lock the 
mutex to proceed. 

When all pthreads have finished using the mutex, you should remove it and release all resources 
associated with it by calling pthread_mutex_destroy(). You cannot destroy a mutex that is 
currently locked. Attempting to lock or unlock a mutex that has been successfully destroyed will 
result in undefined behavior. 


Managing Mutex Attributes 


Synopsis 

Description 

int pthread_mutexattr_create( 

Creates a mutex attributes object. 

pthread_mutexattr_t *attr ); 


int pthread_mutexattr_delete( 

Deletes a mutex attributes object. 

pthread_mutexattr_t *attr ); 



No mutex attributes are currently defined. You can either use the default mutex attributes object, 
pthreadjnutexattrjlefault, or create a mutex attribute object for use in pthreadmutexinitO by 
calling pthread_mutexattr_create(). A user-created mutex attributes object should be released by 
calling pthreadmutexattrdeleteO- 
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An Example Pthreads Program 

The following program demonstrates some principles of using pthreads and mutexes. It creates a 
user-specified number of pthreads, each of which prints its node number, ptype, and pthread ID and 
the message “Done.” 

#include <pthread.h> 

#include <stdlib.h> 

#include <nx.h> 


#define MAXTHREAD 6 


/* thread maximum limit */ 


/* pthread resources 

pthread__t 

pthread_mutex_t 


thread[MAXTHREAD]; 
mutex; 


/* per-thread pthread ID */ 

/* mutex to protect global 
variable "thread__alive" */ 


/* global variables that only the main thread writes to */ 


int 

max_thread = MAXTHREAD; 

/* 

maximum thread 

number */ 

int 

my_node; 

/* 

my node number 

*/ 

int 

my_ptype; 

/* 

my ptype */ 



/* shared global variable that is modified by all threads */ 

int thread_alive; /* count of living threads */ 


/* forward declarations */ 

void thread_fun(int thread__id); /* initial function for 

new threads */ 

main(int argc, char *argv[]) 

{ 

int index; /* loop index */ 

int my_thread =0; /* main thread is indexed 0 */ 


my_node = mynode(); 
my__ptype = myptype (); 

if(argc != 2) { 

if(my_node == 0) { 

printf("Usage: %s <nthreads>\n", argv[0]); 

} 

exit(1); 
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max_thread = atoi(argv[1]); 

if(max_thread > MAXTHREAD) { 
if(my—node ==0) { 

printf("Error: %d threads requested, must be %d or less\n", 
max_thread, MAXTHREAD); 

} 

exit(1); 

} 

/* The main thread is the last thread alive, so don’t count itself. */ 
thread_alive = max__thread - 1; 

/* create and initialize a mutex to control access to "thread_alive" */ 
if(pthread_mutex_init(fcmutex, pthread_mutexattr_.default) == -1) { 
perror("pthread_mutex_init Error"); 

} 

/* 

* Spawn threads and remember each thread's pthread ID. 

* The main thread is indexed as thread 0. 

*/ 

thread [my__thread] = pthread_self (); 

for(index = 1; index < max_thread; index++) { 

if(pthread_create(fcthread[index], pthread_attr_default, 

(void *)thread_fun, (void *)index) -= -1) { 
perror("pthread_create Error"); 
exit(2); 

} 

} 

/* loop until all other threads are finished. */ 
while(thread_alive != 0) { 
pthread_yield(); 

} 

/* 

* Ignore other threads' exit status (can also be done right after 

* pthread_create()). 

*/ 

for(index * 1; index < max_thread; index++) { 
pthread_detach(&thread[index]); 

} 

printf ("( %3d, %3d, %3d) Done\n", my_node, my_ptype, my_thread) ; 
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*********************************************************************** 
threacL_fun() This is the initial function for new threads 

************************************************************************y 
void thread_fun(int my_thread) 

{ 

printf(" ( %3d, %3d, %3d) Done\n", my_node, my_ptype, my_thread); 

/* 

* use Mutex to protect global variable "thread_alive" 

V 

if(pthread_mutex_lock(&mutex) == -1) { 
perror( "pthread_mutex_lock Error" ) ; 

} 

thread__alive- -; 

if (pthread_mutex__unlock (&mutex) == -1) { 
perror("pthread_mutex_unlock Error"); 

} 

/* terminate (status is ignored) */ 
pthread_exit(NULL); 

} 


Assuming this program is called pthreads.c , use the following command to compile it: 

% cc - D_REENTRANT -o pthreads pthreads.c -lpthreads -lc_r -nx 

To run this program with three pthreads per process on two nodes of your default partition, use the 
following command: 

% pthreads 3 -sz 2 

The results of this application run: 


( 

0 , 

0 , 

2) 

Done 

( 

1, 

0 , 

1) 

Done 

( 

0 , 

0 , 

1) 

Done 

( 

0 , 

0 , 

0) 

Done 

( 

1, 

0, 

2) 

Done 

( 

1, 

0 , 

0) 

Done 


Note that the results may appear in a different order on each run. 
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Using Condition Variables to Synchronize Pthreads 


Synopsis 

Description 

ini pthread_cond_inlt( 
pthread_cond_t *cond, 
pthread_condattr_t attr ); 

Creates a condition variable. 

int pthread_cond_wait( 

pthread_cond_t *cond, 
pthread_mutex_t *mutex ); 

Waits on a condition variable. 

int pthread_cond_timedwait( 
pthread_cond_t *cond, 
pthread_mutex_t *mutex, 
struct timespec *abstime ); 

Waits on a condition variable for a specified 
period of time. 

int pthread_cond_signal( 

pthread_cond_t *cond ); 

Wakes up a pthread that is waiting on a condition 
variable. 

int pthread_cond_broadcast( 
pthread_cond_t *cond ); 

Wakes up all pthreads that are waiting on a 
condition variable. 

int pthread_cond_destroy( 
pthread_cond_t *cond ); 

Destroys a condition variable. 


The pthreads package provides condition variables to help you synchronize the execution of 
pthreads. You can also synchronize pthreads by looping on the value of a variable, as shown in the 
previous example, but using condition variables is more efficient. Condition variables use the 
following objects: 

• A mutex (an object of type pthreadjnutexj, as discussed under “Managing Mutexes” on page 
6-16) that is used to protect the pthread condition variable and the global predicate variable. 

• A condition variable (an object of type pthread cond t) that links all pthreads waiting on a 
particular condition. 

• A global predicate variable that indicates the current state of the condition. It could be a global 
integer variable. 

To create and initialize a condition variable, call pthread_cond_init(). This call creates a new 
condition variable with the attributes specified by attr (typically the default condition attributes 
object pthread_condattr_default) and stores the new condition variable’s ID into the variable 
pointed to by cond. The list of pthreads that are waiting on the new condition variable is initially 
empty. 
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To wait for a condition, call pthreadcondwaitO . This call unlocks the specified mutex and blocks 
until the specified condition is signaled by another pthread. When the condition is signaled, the call 
re-locks the mutex and returns to the caller. pthreadcondtimedwaitO is similar, but if the 
specified amount of time passes before the condition is signaled, the call re-locks the mutex and 
returns an error condition. You must successfully lock the specified mutex before calling 
pthreadcondwaitO or pthread cond timedwaitO, and you should unlock the mutex after the 
call to pthreadcondwaitO or pthread cond timedwaitO returns. 

To signal a condition, call pthread_cond_signal() or pthread_cond_broadcast()- 
pthread_cond_signaI() signals the specified condition to oik of the pthreads that is waiting for it 
(if more than one pthread is waiting for the condition to be signaled, the kernel selects one of them 
arbitrarily), pthread_cond_broadcastO signals the specified condition to all of the pthreads that are 
waiting for it If no other pthread is waiting on the condition, these calls have no effect. No mutex 
lock is required for these calls (but a mutex can be used to prevent certain race conditions; see the 
example on page 6-25). 

If a pthread calls pthread cond waitO after the specified condition has been signaled, that pthread 
could wait forever. To prevent this problem, use a global predicate variable. This can be any 
variable that is visible to all pthreads. You use it as follows: 

• Before calling pthread_cond_signalO or pthread_cond_broadcast(), a pthread should set the 
value of the condition’s global predicate value to indicate that the condition has occurred. The 
global predicate value should be protected by a mutex if there is any possibility that more than 
one pthread could try to set it at once. 

• Before calling pthread_cond_waitO or pthread cond timedwaitO, a pthread should check 
the current value of the condition’s global predicate variable. If the condition has already 
occurred, the pthread should proceed without calling pthread cond waitO or 
pthreadcondtimedwaitO. 

• After a successful call to pthread cond waitO or pthread cond timedwaitO, a pthread 
should check the global predicate value to be sure it has the expected value. If the global 
predicate variable does not indicate that the condition has occurred, the pthread should call 
pthread cond waitO or pthread cond timedwaitO again. 

The first example shown under “Examples of Condition Variables” on page 6-24 gives an example 
of this technique. 

When all pthreads have finished using a condition variable, you should remove it and release all 
resources associated with it by calling pthreadconddestroy (). You cannot destroy a condition 
variable that is currently being waited on. 
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Managing Condition Attributes 


Synopsis 

Description 

int pthread_condattr_create( 

Creates a condition variable attributes object. 

pthread_condattr_t *attr ); 


int pthread_condattr_delete( 

Deletes a condition variable attributes object. 

pthread_condattr_t *attr ); 



No condition attributes are currently defined. You can either use the default condition attributes 
object, pthread_condattr_default, or create a condition attribute object for use in 
pthread_cond_init() by calling pthread_condattr_create(). A user-created condition attributes 
object should be released by calling pthreadcondattrdeleteO- 
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Examples of Condition Variables 

The following example uses a mutex to protect the global predicate variable condjrue , which is 
used to prevent the signaling pthread from calling pthread_cond jsignalO until the waiting pthread 
has called pthread_cond_wait(). Note that the call to pthread_cond_agnalO is within the mutex 
lock; this is not necessary, but can prevent certain race conditions. 

The pthread waiting for the condition executes the following code: 

if(pthread_mutex_lock(fcmutex) == -1) { 
perror( "pthread_mutex_lock Error" ); 

} 

/* 

* If the expected condition already exists, don't call 

* pthread_cond_wait (), since the condition signal will not be 

* sent if no threads are waiting for this condition. 

* 

* Recheck the state of cond_true after calling 

* pthread_cond_wait() to ensure that the received condition 

* signal is for this expected state change. 

V 

while( !cond__true) { 

/* 

* mutex will be unlocked in pthread_cond_wait() when 

* calling thread is ready to wait. 

*/ 

if(pthread_cond_wait(&cond, &mutex) == -1) { 
perror( "pthread_cond__wait Error" ); 
break; 

} 

/* 

* mutex will be locked again when the calling thread 

* is awakened. 

V 

} 

if (pthread_mutex_unlock (&mutex) == -1) { 
perror("pthread_mutex_unlock Error"); 

} 
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The pthread signaling the condition executes the following code: 

if (pthread_mutex_lock(&mutex) “ -1) { 
perror(”pthread_mutex_lock Error”); 

} 

/* This global variable needs a mutex * s protection. */ 
++cond_true; 

/* 

* The pthread_cond_signal() call does not use a mutex internally. 

* The mutex protection will guarantee that every thread can 

* catch the expected condition signal once it calls 

* pthread_cond_wait () . This will prevent the endless block of 

* the calling thread. 

V 

if(pthread_cond_signal(&cond) -1) { 

perror(”pthread_cond_signal Error”); 

} 

if(pthread_mutex_unlock(&mutex) “ -1) { 
perror( ”pthread_mutex_unlock Error” ); 

} 

Here’s another example, which uses pthread_cond_broadcastO to wake up all waiting pthreads. 
/* 

* Simulate a thread-level gsync() to synchronize all active 

* threads at a barrier. A counter in this example can only be 

* used once. 

V 

long 
long 

pthread_mutex_t 
pthread_cond_t 


cond_gsync; /* counter of threads arrived */ 
max_thread; /* number of active threads */ 
mutex; 
cond; 


void thread_gsync (long *cond_gsync) 

{ 

if(pthread_mutex_lock(fcmutex) -1) { 
perror(”pthread_mutex_lock Error"); 

1 

/* Increase the count of threads that have called 
thread__gsync() */ 

*cond_gsync++; 
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if(*cond_gsync == max_thread) { 

/* 

* If I’m the last thread to call threadygsync(), 

* wake up all threads waiting on this condition. 

V 

if(pthread_cond_broadcast(&cond) == -1) { 
perror("pthread_cond_broadcast Error”); 

} 

} else { 

while(*cond_gsync != max_thread) { 

/* 

* Other threads haven't called thread_gsync() yet, 

* so wait for them in pthread_cond_wait(). 

*/ 

if (pthread_cond_wait (&cond, &mutex) == -1) { 
perror( "pthread_cond_wait Error" ); 
break; 

} 

} 

} 

if (pthread_mutex_unlock (&mutex) == -1) { 
perror("pthread_mutex_unlock Error"); 

} 


main() 

{ 

/* 

* Initialize counter and the condition that counter 

* will meet. 

V 

max_thread = atoi(argv[1]); /* threads to create */ 
cond_gsync = 0; /* init counter */ 

if(pthread_cond_init(&cond, 

pthread_condattr_.default) == -1) { 
perror("pthread_cond_init Error"); 

} 

if(pthread_mutex_init(&mutex, 

pthread_mutexattr_default) == -1) { 
perror("pthread_mutex_init Error"); 

} 
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/* Create more pthreads */ 


} 


/* Every thread calls thread_gsync() once */ 
thread_gsync(&cond_gsync); 

/* Every thread passes this barrier at the same time */ 
Here’s an example using pthreadcondtunedwaitO: 

#include <sys/timers.h> 

long interval = 10; /* 10 seconds interval */ 

struct timespec abs_time; 

/* Get the current time */ 
getclock(TIMEOFDAY, &abs_time); 

/* 

* Can use another member of structure timespec to specify the 

* waiting interval in nanoseconds. But the resolution cannot be 

* smaller than the interval between updates of the system clock. 

* 

* The wait time should not be so small that the 

* absolute time specified is smaller than the 

* time spent inside the pthread_cond_timedwait () call. 

*/ 

abs_time. tv__sec = abs_time. tv_sec + interval; 

if(pthread_mutex_lock(amutex) *= -1) { 
perror("pthread_mutex_lock Error"); 

} 

while(lcondition) { 

if (pthread_cond_timedwait(&cond, amutex, &abs_time) “ -1) { 
/* EAGAIN is the timeout error code. */ 
if(errao !* EAGAIN) { 

perror("pthread_cond_timedwait Error"); 

} 

break; 

} 

} 

if (pthread_mutex_unlock (amutex) == -1) { 
perror("pthread^mutex.unlock Error"); 

} 
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Canceling Pthreads 


Synopsis 

Description 

int pthread_cancel( 

pthread_t thread ); 

Requests cancellation of a pthread. 

int pthread_setcancel( 

int state ); 

Enables or disables the general cancelability of 
the calling pthread. 

int pthread_setasynccancel( 

int state ); 

Enables or disables the asynchronous 
cancelability of the calling pthread. 

void pthread_testcancel(void); 

Creates a cancellation point in the calling pthread. 


The pthreads package includes a pthread cancellation mechanism that allows a pthread to terminate 
the execution of other pthreads. The call pthread_cancel() requests cancellation of the specified 
pthread; however, the specified pthread may terminate later or not at all, depending on its 
cancelability states. 


Cancelability States 

Each pthread has two cancelability states that determine how it reacts to cancellation requests. Each 
of the two states can be set to the value CANCEL ON (enabled) or CANCEL OFF (disabled). 

• If general cancelability determines whether or not the pthread can be canceled: 

If general cancelability is enabled, cancellation requests are accepted. Cancellation may or 
may not occur immediately, depending on the asynchronous cancelability state. 

If general cancelability is disabled, cancellation requests are queued until general 
cancelability is enabled again. 

General cancelability is enabled by default; a pthread can change its general cancelability state 
by calling pthreadjsetcancelQ. 
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• When general cancelability is enabled, a second cancelability state called asynchronous 
cancelability determines how quickly the cancellation occurs: 

If asynchronous cancelability is enabled, when a cancellation request is received the 
pthread begins termination immediately. 

If asynchronous cancelability is disabled, when a cancellation request is received the 
pthread does not begin termination until it reaches a cancellation point. The default 
cancellation points are calls to pthreadcondwaitO, pthread_cond_timedwait(). 
pthread JoinO, and pthread_setcancel(CANCEL_ON). A pthread can also create an 
explicit cancellation point by calling pthread_testcancel(), which otherwise does nothing. 

Asynchronous cancelability is disabled by default; a pthread can change its asynchronous 
cancelability state by calling pthread setasynccancdQ. 


NOTE 

Asynchronous cancelability should not be enabled in the current 
release. 


Asynchronous cancellation of certain pthreads, particularly pthreads performing file I/O, can cause 
the entire application to hang. 


NOTE 

You must be careful not to cancel a pthread that is holding a mutex 
lock. 


Canceling a pthread that is holding a mutex lock leaves the mutex locked with no way to unlock it, 
possibly resulting in deadlock. For example, a pthread calling printf() will get a mutex lock inside 
the reentrant C library. A cancellation of this pthread during the call to printfO will cause all other 
pthreads calling printfO to deadlock. 
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Functions such as printfD* which can cause deadlock if they are canceled, are called not safe to 
cancel. 

NOTE 

Most library functions are not safe to cancel. 

In particular, all of the calls in libnx.a are not safe to cancel. The list of functions that is safe to cancel 
can be found in the pthreadsetasynccancelQ manpage in the OSFI1 Programmer’s Reference. 


Cancellation Examples 

Here’s an example of changing a pthread’s cancelability states: 

/* flip the general cancelability of the calling thread */ 
if(pthread_setcancel(CANCEL_ON) == -1) { 
perror( n pthread_setcancel Error”); 

} 

if(pthread_setcancel(CANCEL_0FF) == -1) { 
perror( ”pthread__setcancel Error” ); 

} 


/* flip the asynchronous cancelability of the calling thread */ 
if(pthread_setasynccancel(CANCEL_ON) == -1) { 
perror(”pthread_setasynccancel Error”); 

} 

if(pthread_setasynccancel(CANCEL_OFF) == -1) { 
perror(”pthread_setasynccancel Error”); 

} 
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Here’s an example of delivering and accepting cancellations: 

pthread_t threaded; /* value from pthread_create () call */ 


/* 

* Cancel another thread whose pthread ID is 

V 

if(pthread_cancel(thread_id) »■ -1) { 
perror("pthread_cancel Error\n"); 

} 


"threaded" . 


/* 

* If a cancellation request is already posted, this call will 

* not return. 

*/ 

pthread_testcancel(); 

/* Execution continues if no posted cancellation request */ 
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Pthreads Cleanup Routines 


Synopsis 


Description 


void pthread_cleanup_pop( 

int execute ); 


Removes a routine from the top of the cleanup 
stack of the calling pthread and optionally 
executes it. 


void pthread_cleanup_push( Pushes a routine onto the cleanup stack of the 

void (*routineXv oid *arg), calling pthread. 

void *arg ); 


Pthreads may have resources that must be released before the pthread terminates. Each pthread can 
create a list of cleanup routines, called the cleanup stack, to release those resources. The routines on 
the cleanup stack are called, in order from top to bottom, when the pthread terminates for any of the 
following reasons: 

• Calling pthread_exit(). 

• Returning from its initial function. 

• Being cancelled by another pthread. 

To place a function on the cleanup stack, call pthread_deanup_pushO; to remove the top function 
from the cleanup stack, call pthread_cleanup_pop(). You can optionally execute the function as it 
is popped. Every call to pthread_cleanup_push() must be matched with a pthread_cleanup_pop() 
call in the same lexical scope (that is, within the same set of “{ ... } ” braces). 

If general cancelability is enabled, whenever a pthread allocates a resource it should push a function 
that deallocates that resource onto the cleanup stack; when the pthread is finished with the resource 
it should deallocate it by popping the function off the cleanup stack and executing it. This ensures 
that all resources are accounted for if the pthread is cancelled. 
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Managing Pthread Keys 


Synopsis Description 

im pthread_keycreate( Creates a key to be used with pthread-specific 

pthread_key_t *fey, data, 

void (fdestructor\\ oid *value )); 

int pthread_setspecific( Binds a pthread-specific value to a key. 

pthread_key_t key, 
void *value ); 

int pthreadjgetspecific( Returns the value bound to a key. 

pthread_key_t key, 
void **value ); 


The pthreads package provides pthread-specific data objects to associate information with 
individual pthreads. Each pthread-specific data object is controlled by a fey (an object of type 
pthread keyj). A pthread creates a new key by calling pthread_keycreate(), associates the key 
with a pthread-specific data object by calling pthreadjsetspedflcO. and then retrieves the data 
associated with the key by calling pthread_getspecificO. See the OSFI1 Programmer’s Reference 
for more information on these calls. 
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Executing a Routine Once 


Synopsis Description 

int pthread_once( Calls an initialization routine. 

pthread_once_t *once_block, 

\oid(*routine)Q ); 


The pthread_once{) call executes the specified routine the first time it is called (from any pthread), 
and does nothing every subsequent time. The parameter once block must be declared as static. For 
example: 

static pthread_once_t init_once; 
void lib_util_init() { 

/* perform some initialization that can only be done once */ 

} 


/* 

* Every pthread calls pthread_once(), but only the first one 

* executes lib_util_init(). 

V 

if(pthread_once(&init_once, lib_util_init) == -1) { 
perror("pthread_once Error"); 

} 


Managing Signals 


Synopsis 

Description 

int sigwait( 

sigset_t *set ); 

Suspends the calling pthread until one of a 
specified set of signals is received. 


The sigwaitO call is used to turn asynchronous signals into synchronous notifications. Before calling 
sigwaitO, you must create a signal set , using the standard signal calls sigemptysetO, sigfillsetO, 
sigaddsetO, and sigdelsetO, and then block the signals in that set from being delivered. When you 
call sigwaitO with that signal set, the calling pthread is suspended until one or more of the signals 
in the set is received by the process containing the pthread. If one of the specified signals was 
received (and blocked) before the call to sigwaitO, the call returns immediately. sigwaitO returns 
the signal number of the signal that was received. 
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The sigwait() call only works for asynchronous signals (those that are generated externally from the 
pthread, such as those generated by ldllO in other processes or by the user pressing < Ctrl-\>). 
Contrast this with sigactionO, which only works for synchronous signals (those that are generated 
as the result of the pthread's faults, such as SIGBUS). If both sigactionO and sigwaitO are used on 
the same signal, the results are unspecified. 


NOTE 

In a parallel application, sending an asynchronous signal to an 
application’s controlling process affects the controlling process (as 
specified by the controlling process’s signal mask), and also 
causes the signal to be broadcast to the compute processes. In an 
application linked with -nx, the controlling process’s signal mask 
is always the default. 


See “Signals and Pthreads Library Calls” on page 6-39 for more information on signals in 
applications with multiple pthreads. 

Here’s an example that uses sigwaitO to deal with the asynchronous signal SIGQUIT. This example 
uses a parent process to generate the asynchronous signal by calling kiU(). 

long sig; 

long ret; 

sig = SIGQUIT; 

pid = fork{); 

if(pid == -1) { 

perror("fork()"); 
exit(1); 

} else if(pid != 0) { 

/* parent process */ 
sleep(2); 

/* 

* Deliver the signal SIGQUIT to child process. 

*/ 

if(kill(pid, sig) “ -1) { 
perror("kill "); 
exit(1); 

} 

exit(0); 

} 
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/* child process */ 
if(sigemptyset(&set) != 0) { 
perror("sigemptyset"); 

} 

/* Add the signal SIGQUIT to the signal mask */ 
if(sigaddset(&set, sig) != 0) { 
perror("sigaddset"); 

} 

/* Block the signal SIGQUIT from delivery */ 
if(sigprocmask(SIG_BLOCK, &set, NULL) !* 0) { 
perror("sigprocmask()"); 

} 

/* 

* During the next 10 seconds, the posted signal from the 

* parent process becomes a pending signal. 

V 

sleep(10); 

/* 

* sigwait() blocks the calling thread until the specified signal 

* arrives, then unblocks and returns the value SIGQUIT. 

V 

if((ret = sigwait(fcset)) == -1) { 
perror("sigwait()"); 

} else { 

printf("Received signal %d, expected %d\n", ret, sig); 

} 

/* 

* The thread can decide what to do with this signal. 

V 

/* 

* There is no destructive default action of core dump on the 

* posted signal SIGQUIT. 

*/ 
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Interfacing with Non-Thread-Safe Code 

Whenever you call a non-thread-safe library from a process with multiple pthreads, you must make 
sure that no two pthreads call the same library at the same time. There are two ways do this: 

* Make sure that only one pthread ever calls the library. 

• Use mutexes to protect all calls to the library. 

Here’s an example of the second technique: 

pthread_mutex_lock(&mutex); 
non_thread_safe_call(); 
pthread_mutex_unlock(fimutex); 

Note that the same mutex must be used by all pthreads for any calls from the same library. If all calls 
to the non-thread-safe library are surrounded by a lock and unlock of the same mutex, as shown here, 
any pthread that calls the library while another pthread is currently calling it will block until the other 
pthread returns and unlocks the mutex. See “Managing Mutexes” on page 6-16 for more information 
on mutexes. 


Message Passing and Pthreads Library Calls 

Paragon OSF/1 message-passing is done on a process-by-process basis. All message-sending calls 
specify the recipient by node and process type; there is no way to specify a particular pthread within 
that process (all the threads in a process have the same process type). Similarly, when a message 
arrives at a process, there is nothing to prevent confusion among pthreads; for example, a pthread 
could probe for a message, find a pending message of the specified type, and then attempt to receive 
it—only to find that another pthread has already received it. For this reason, you should make sure 
that only one pthread in each process uses message-passing calls. 

You should also keep the following special considerations in mind when using message-passing and 
pthreads in the same application: 

• Blocking calls, such as csendO, only block the calling pthread, not the entire process. While the 
calling pthread is blocked, other pthreads can continue to run. The pthread that is blocked 
releases processor resources until the csendO returns. 

• When the message-passing pthread uses one of the global calls (those described under “Global 
Operations” on page 3-27), the call blocks until the message-passing pthread on every other 
node makes the same call. If one of those message-passing pthreads is Mocked (for example, by 
a mutex lock), the operation will hang all the message-passing pthreads in the application. 
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• An hsend()/hrecv() handler cannot use calls from libpthreads.a or libc_r.a (note that libc_r.a 
includes almost the entire C library). Using these calls within a handler can give unexpected 
results or cause the handler to hang. If you need to use any of these calls, you can have the 
handler use csendO to send a message to the main message-passing pthread to cany out the 
desired operation. 

• Because an hsendO/hrecvO handler cannot use calls from libc_r.a, if a call to libnx.a within a 
handler fails, it could also hang the handler. This occurs because the failed library call will call 
printf() to print out the error message. To avoid this problem, use only underscore calls within 
hsend()/hrecv() handlers. 

• An hsend()/hrecv() handler also should not use the info«.() calls. Because the handler executes 
concurrently with the main message-passing pthread, the info...() calls may return values 
representing messages received by the main message-passing pthread. The main 
message-passing pthread can use masktrapO to protect critical regions from the handler. 

• If an hsend()/hrecv() handler performs any message passing, you must put masktrapO calls 
around any message-passing calls in the main message-passing pthread that could be called 
while the handler is active. Otherwise, any info...() calls in the handler could reflect the value 
of a message received by the main message-passing pthread. 

In addition, any info...O call in the main program must be within the same set of masktrapO 
calls as the message-receiving call to which it applies. Otherwise, the info„.() call in the main 
message-passing pthread could reflect the value of a message received by the handler. 


File I/O and Pthreads Library Calls 

In general, opened files are per-process resources. A pthread can open a file, a second pthread can 
use the open file descriptor to write or read, and a third pthread can use the same file descriptor in 
an lseekO call. The movement of file pointers is visible to all pthreads, so if multiple pthreads are 
accessing the same file they must coordinate their actions with mutexes or condition variables. 
However, blocking calls such as readO and cwriteO only block the calling pthread, not the entire 
process. 

If two pthreads doing file I/O read and write concurrently, they can each read and write their own 
data independently. If you are performing I/O to a file in a synchronized PFS I/O mode (see “Using 
I/O Modes” on page 5-13), the synchronization information is stored with the file descriptor, each 
file is synchronized independently. 

See “Recommended Safe Operating Environment” on page 6-4 for limitations on using I/O calls 
from multiple pthreads. 
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nx_nforkO and nxJnitveO and Pthreads Library Calls 

In a controlling process with multiple pthreads: 

nx_nfork() Copies only the calling pthread to the new process on each node. Can only be 

called from one pthread. 

nxinitveO If the user’s shell is the Bourne shell (sh), nxJnitveO performs a forkO 
internally. As described under “Using Reentrant C Library Calls” on page 
6-6, fork() copies only the calling pthread to die new process. This means that 
if there are multiple pthreads in the calling process before the call to 
nx initveO, after the nx initveO all pthreads except the calling pthread will 
appear to cease to exist. (If the user’s shell is ksh or csh, this problem does 
not exist.) 

nx initveO can be called at most once in a process. This means that at most 
one pthread in a process can call it. 


Signals and Pthreads Library Calls 

The following special considerations apply to signals in programs with multiple pthreads. 


Signal Types 

There are two types of signals: synchronous and asynchronous: 

• Synchronous signals are caused by a pthread’s own actions, such as when a pthread divides by 
zero (causing a SIGFPE signal) or attempts to access memory outside its address space 
(causing a SIGSEGV signal). 

• Asynchronous signals are caused by something external to the pthread, such as another process 
calling killO or the user pressing <ctrl-\> on the keyboard (causing a SIGQUTT signal). 

If a pthread causes a synchronous signal, the handler routine executes in the context of that pthread 
only. If an asynchronous signal is delivered to a process, the handler routine executes in the context 
of the main thread. 
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Signals are a Per-Process Resource 

Signals ate generally managed as per-process objects in pthreads programs. Signal masks, signal 
handlers, and signal sending and receiving are all oriented toward the process, not toward a 
particular pthread. This means that signals affect the entire process. Of particular interest: 

• A SIGSTOP signal stops all pthreads of the receiving process. 

• A SIGCONT signal continues all pthreads of the receiving process. 

• If one pthread of a program with multiple pthreads causes a SIGSEGV or SIGBUS, the entire 
process (not just the faulting pthread) receives the signal. If this signal has not been handled, all 
pthreads are killed and the program core dumps. 

Which pthread is interrupted to execute a registered handler for a signal may be specific to the type 
of signal, but in general the entire process receives the signal. 

It is important to be aware that a pthread program's signal mask has a per-process visibility. In other 
words, all pthreads share the same mask. If one pthread changes its mask (for example, by calling 
sigprocmaskO) the change affects all pthreads. The thread-safe sigwaitO requires manipulation of 
the signal mask, as does sigactionO and other common signal-management routines. 

Along with the signal mask, signal handlers are also process-wide objects. A signal handler can be 
registered for the process (for example, by calling sigactionO or sigwaitO) by any pthread. Because 
the handlers are process-wide objects, a second pthread registering a handler for a given signal will 
override the handler registered by the first pthread. 

In general, blocking calls only block the calling pthread. This is the case with sigsuspendO as it is 
with wait(), sleepO etc. Note also, that if multiple pthreads are blocking on sigsuspendO for a given 
signal, all pthreads will continue when that signal arrives. This differs from sigwaitO which 
unblocks only one of the pthreads. 

sigwaitO creates a hidden pthread which manipulates the process’s signal mask and registers a signal 
handler for each signal sigwaitO has been asked to wait for. Because of this, use of other signal 
management calls (especially sigactionO) on the signals being waited for, would be hazardous. Care 
must be taken when changing a signal mask so the state of a sigwait()ed signal's bit is not changed. 
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Dealing with Signals 

A way to deal with signals in a pthread application is the following: 

• Use sigactionO to catch the synchronous signals. sigactionO only works with synchronous 
signals. 

• Use sigprocmaskO to block the asynchronous signals, then sigwait() to receive the signal as a 
notification. sigwaitO only works with asynchronous signals. 

Do not use sigwaitO and sigactionO on the same signal. 


Handling Errors 

The handling of error situations in a program with multiple pthreads should be robust and graceful. 
It should protect the pthreads that did not cause the error from being interrupted or terminated. It also 
should give information on which pthread caused the error and coordinate a proper shutdown of all 
pthreads if the error is fatal. If the error cannot be recovered from by the pthread causing the error, 
and other pthreads are depending on this pthread to progress, it might be best to terminate those 
pthreads right away and shut down the rest of the pthreads later. 


errno Confusion 

In a single-threaded program, the global variable ermo is set to an error value when a system or 
library call fails. An immediate call to perrorO or nx_perror() will read the value of ermo and print 
out the error message corresponding to its current value. User-written code may also set ermo to take 
advantage of this standard error-handling mechanism. 

In a program wit h multiple pthreads, the global ermo variable is replaced by a per-pthread ermo. 
When a failing system or library call sets the per-pthread ermo, the other pthreads’ errnos are not 
affected. The selection of per-pthread or per-process ermo is determined at compile time by the 
preprocessor symbol REENTRANT: 

* ^REENTRANT is defined at the point the file <ermo.h> is included, the symbol errno 
refers to the per-pthread ermo. 

• Otherwise, the symbol ermo refers to the global (per-process) ermo. 

The symbol REENTRANT is defined in <pthread.h>. However, because some programs include 
other header files before <pthread.h> , you should always use the switch -DJREENTRANT on the 
command line when compiling a program that uses multiple pthreads. This symbol ensures that the 
conect versions of call prototypes and preprocessor symbols are pulled in from header files. 
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In the current release, the libraries provided with Paragon OSF/1 are not consistent in their use of 
the two different ermos: 

• libpthreads.a sets and references only the per-pthread ermo. 

• libc_r.a sets both the global and the per-pthread ermo, but only references the per-pthread 
ermo. 


• All other libraries set and reference only the global ermo. 

This creates an inconsistency of different ermos between pthreads programs and most libraries. For 
example, libnx.a uses the global ermo. This means that if a call to libnx.a fails in a program compiled 
with REENTRANT. the ermo value it returns is not usable by the calling pthread. The library call 
sets the global variable ermo, but the calling pthread can see only its own pthread-specific ermo 
(whose value does not reflect the error in the libnx.a call that just failed). 

In general, this means that code compiled with REENTRANT cannot use ermo values returned 
by non-thread-safe libraries. However, because some non-thread-safe libraries make calls to the 
standard C library, some ermo values are usable. For example, creadO (in the non-thread-safe 
library libnx.a) calls readO- If you link with libc_r.a, any error that occurs in readO will be reflected 
in the per-pthread ermo and will be visible to the calling pthread. But any error that occurs in the 
creadO call itself (before or after the call to readO) will not be visible to the calling pthread. 


perrorO and nx_perror() 

The perrorO and nx_perror() calls print an error message based on the following versions of ermo: 

• The perrorO call in libc.a uses the global ermo. 

• The perrorO call in libc_r.a uses the per-pthread ermo. 

• The nx_perrorO call in libnx.a uses the global ermo. 

Because nx_perror() is not a thread-safe call, you should be sure to protect it with a mutex so that 
no two pthreads can call it at once. 


Calling exitO 

Calling exitO when an error occurs terminates the entire process and closes any opened files. For 
this reason, it’s a bad idea to call exitO on an error returned from a system call inside a pthread. 
Instead, you should call pthread_exlt() to terminate the failing pthread and return a value indicating 
failure to the pthread that calls pthread_join(). The pthread that calls pthread JoinO should use this 
information to shut down all other pthreads properly. 
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Use of Underscore Versions of Paragon System Calls 

The standard versions of most Paragon system calls in libnx.a terminate the calling process when an 
error occurs and send a message to standard error describing the error. For example, isendO will call 
nx_perror() to print out the error message, then call exitO to terminate the process. This implies that 
if a pthread causes an error in an isendO call, this error will kill the rest of the pthreads in the process. 
If this error occurs inside a hsendO and hrecvO handler routine, it will hang the handler routine. 

For this reason, in programs with multiple pthreads you should always use the underscore versions 
of these calls instead. For example, calling isendO will return -1 and set the global ermo in case of 
error, instead of terminating the process. The calling pthread can then use this information to shut 
down the rest of the pthreads cleanly. See “Handling Errors” on page 4-42 for more information on 
underscore calls. 


Catch Signals Causing Core Dump by Default 

The default action for the signals SIGFPE (floating point exception) and SIGSEGV (segmentation 
violation) is to core dump, terminate the process, and terminate the application. This will also kill 
all pthreads in the application. 

When this occurs, you want to be able to figure out which pthread was responsible for this problem. 
For synchronous signals, the best way to do this is to install a signal handler to catch them and print 
out the pthread ID when the signal is received. For asynchronous signals, use sigwaitO to catch them 
and then terminate all pthreads gracefully. 


When One Pthread Hangs 

When debugging a program with multiple pthreads, always keep track of every active pthread, as 
much as possible, to detect the hang of a single pthread. Knowing which pthread has hung will help 
you determine the cause of a program hang. 
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Introduction 

This chapter describes some general design guidelines to follow when writing parallel applications. 

However, the best way to become skilled in parallel programming is to do it With that in mind, this 

chapter presents three examples of parallel applications. Each example is intended to illustrate a 

different aspect of parallel design technique. 

• The first example is a nearly-perfectly-parallel application that evaluates a definite integral to 
calculate n. This example illustrates how a sequential application can be ported to a parallel 
system with minimal effort. Much of the sequential algorithm can be maintained. The parallel 
design consists of separating the user interface from the core computation and then distributing 
that core computation onto the nodes. 

• The next example is the multiplication of a matrix by a vector. In addition to the numerical 
technique, this example illustrates the use of parallel file I/O by assuming a matrix that is too 
large to reside entirely in memory. 

• The third example solves a classic computer science problem called the N-Queens problem. 
Given a chess board with N x N grid locations, where can you place N queens so that no queen 
is under attack? This example illustrates a technique called control decomposidoa This 
technique also appears in more complicated real-world applications such as electronic design 
rule checking. 


7-1 








Designing a Parallel Application 


Paragon™ User’s Guide 


The Paragon™ OSF/1 Programming Model 

As described in Chapter 1, the Intel supercomputer is a distributed-memory parallel computer with 
a high-speed interconnect network. The following characteristics of the system should be kept in 
mind when designing or porting applications: 

• The system is made up of an ensemble of processor/memory pairs called nodes. The nodes do 
not share memory. They present a single system image (for example, a process running on one 
node can send a signal to a process running on another node), but the nodes operate 
independently of each other. 

• All the nodes are fully connected. They communicate with each other and the host by passing 
messages. 

• Each node executes its own program. In many applications, it turns out that each node executes 
the same program on a different set of input data. There may be some conditional code that 
identifies one or more nodes that perform special actions. 

These characteristics influence the design of parallel applications, as described in the remainder of 
this chapter. 


Parallel Programming Techniques 

Parallel applications have varying degrees of parallelism. A perfectly-parallel application is one that 
requires no intemode communication. In a perfectly-parallel application, if you double the number 
of nodes, you halve the computation time. 

Most applications involve a mix of computation and intemode communication; in these applications, 
increasing the number of nodes reduces the computation time, but can never yield a “perfect” 
speedup. The more time a program spends communicating instead of computing, the less speedup 
you get by adding nodes. 

In order to get the best possible speed from a parallel program, you must design it so that each node 
spends as much time as possible computing, and as little time as possible communicating (or waiting 
for communication). Here are some techniques that can help you to do this: 

Separate the user interface from the computational parts of the code. 

• Distribute the computation among the nodes so that their computational load is evenly balanced. 

• Write your application so that you can run it on more nodes, thus improving performance, 
without having to recode. 

• Design your intemode communication such that the nodes spend as little time in communication 
(or waiting for communication) as possible. 

The following sections tell you more about these techniques. 
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Separating the User Interface from the Computation 

To have each node do as much computation, and as little non-computational work, as possible, you 
should analyze the algorithm and separate the user interface from the computational kernel. You can 
rifsignatp one of the nodes to handle the user interface, or put the user interface in the application’s 
controlling process (see “The Controlling Process” on page 4-21 for information on this process). 
In either case, the part of the program that handles the user interface and the part of the program that 
does the computation communicate by passing messages. 

In the n example, node 0 requests the number of integration intervals from the user. It then sends that 
number to the other nodes, and all the nodes do the calculation. 


Balancing the Load 

You should keep all the nodes busy and have them finish at the same time, because if some nodes 
have to wait for others to finish, they’re wasting cycles doing nothing. Analyze your application and 
distribute the computation among the nodes so that their computational load is evenly balanced. 

The process of distributing a problem among the nodes is referred to as problem decomposition, or 
just decomposition. There are two kinds of decomposition: domain decomposition and control 
decomposition. 


Domain Decomposition 

In domain decomposition, the input data (the domain) is partitioned and assigned to different 
processors. How you divide and distribute the data among the nodes can have a significant effect on 
the efficiency of your application. 

For example, consider an application that performs image enhancement (see Figure 7-1). Because 
some parts of the image may be more detailed than others, they will require more processing. The 
shaded portion of Figure 7-1 shows the work done by node 0. If you divide the image sequentially 
among die nodes, as shown in the top half of Figure 7-1, some nodes may get a partition that requires 
a lot of work and other nodes may get a partition that requires little or no work. In the top half of 
Figure 7-1, node 0 gets a lot of work and node 7 gets no work at all. This is inefficient. 

You can often achieve better load balancing by dividing the image into smaller partitions and then 
distributing the partitions sequentially among the nodes, as shown in the bottom half of Figure 7-1. 
This is analogous to dealing out the partitions like cards in a deck; it spreads out the work more 
evenly among the nodes. As the bottom half of Figure 7-1 shows, each node gets some slices that 
require a lot of work, some slices that require a moderate amount of work, and some slices that 
require no work. This is more balanced and efficient for this type of problem, and may be appropriate 
for your problem as well. 
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Poor load balancing: Nodes 0 through 3 get most of the work. 
Nodes 4 through 7 have little or nothing to do. 




0 1 2 3 4 5 6 7 


Good load balancing: The partitions in the domain are dealt out to 
the nodes like cards from a deck. Now, each node has 
approximately the same amount of work. 
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Figure 7-1. Using Domain Decomposition to Achieve Load Balancing 
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Control Decomposition 

Control decomposition, on the other hand, divides the tasks to be performed rather than the data. For 
many problems, this is a more natural decomposition. 

For example, consider a tree-search used in a game-playing algorithm. Assume that you’re at some 
mid-level of the tree. You could approach the problem as a domain decomposition and divide the 
current branches among the nodes. Each node would then follow its branch down to the leaves and 
then return the leaves as an answer. The leaves in this case are the possible moves. Depending on 
the current state of the game, some of the branches may be quite involved and require a great deal 
of processing. Other branches may be simple. The result is that some nodes finish before others. This 
is a poor problem decomposition. 

Approaching this problem as a control decomposition achieves better load balancing. In a control 
decomposition, you think of the branches not as data partitions but rather as tasks that need to be 
performed. 

To manage these tasks, you have to introduce a little bureaucracy. Assign one node as a manager 
node. This manager node then gives tasks to idle nodes. When the node finishes a task, it reports its 
answer and requests another task. It’s this “reporting for duty” that characterizes a control 
decomposition. 

The manager node must, of course, do some initial setup. For example, it may follow the tree down 
until the number of branches exceeds the number of available nodes by some predetermined factor. 

This method produces the best results when the tasks assigned near the end of the problem are about 
the same size. For example, if one of the last tasks assigned was a very long task, the other nodes 
may be idle while that last node finishes. 

The N-Queens example (presented later in this chapter) shows control decomposition. 


Making the Program Independent of the Number of Nodes 

You should write your application so that you can run it on more nodes, thus improving 
performance, without having to recode. 

This method also turns out to be the most natural one to use when porting an existing sequential 
application. After you’ve separated the user interface from the core computation, you still have a 
sequential algorithm, but you can think of it as the special case of an application that runs on one 
node. Once you have done this, you can parallelize die computation part for an arbitrary number of 
nodes. 


The it example illustrates this technique. The number of nodes appears only as the variable nodes. 
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Designing Your Communication Strategy 

Your should design your intemode communication such that the nodes spend as little time in 
communication as possible. This may involve running some tests to determine an optimal message 
length. Often, you can decrease the number of messages by increasing the size of each message. You 
may also be able to improve communication performance by using asynchronous message-passing 
calls, as described under “Asynchronous Send and Receive” on page 3-10. 


Using Global Operations 

You should use the global operations, described under “Global Operations” on page 3-27, when 
possible. That section described a simple example of a global sum. Using gdsumO results in a 
significant improvement over having one node perform the global sum by explicitly collecting all 
the partial sums. Also, after the execution of the gdsumO, the global sum is available on each node. 

The matrix*vector example in this chapter uses another global operation called gcolx(). In that 
example, a large vector is distributed over the nodes. gcolxO collects the components from each 
node and constructs the complete vector on each node. As with gdsumO, the answer is available on 
each node. 


Using Alternate Node Topologies 

The nodes in the Intel supercomputer are connected in either a hypercube or a mesh network. 
However, because of the specialized message-passing hardware in both architectures, 
communication with distant nodes is nearly as fast as communication with neighboring nodes. This 
means that you do not have to structure your application’s communications as a hypercube or mesh; 
you can choose an alternate topology that makes more sense for your program. This can make your 
program easier to write and understand, at a tiny cost in performance. 

When you use an alternate node topology, you embed your node topology (a virtual topology ) into 
the nodes ’ actual network topology (the physical topology ). One example of a virtual topology is the 
ring. This topology is useful in certain types of many-body calculations. The technique consists of 
partitioning the particles into groups and assigning each group to a different node. A node then 
calculates the state of its group. This state information is then passed to another node which 
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calculates the state of its own particles and takes into account the state received from the previous 
node. The state information moves from node to node around a ring. You can implement a ring 
topology by writing a function like this one: 

succ(int n) 

{ 

int maxnode; 

maxnode « numnodes() - 1; 

if ( (n >= 0) && (n < maxnode)) 
return(n+1); 
else if (n == maxnode) 
return(0); 

else 

return(-1); 

} 

Given a valid node ID («), this function returns the node ID of the successor of node n in a ring 
embedded in a partition with mimnodesO nodes. Else it returns -1. (The predecessor function is 
similar.) A node can send a message to process type 0 on its successor node with the following 
csendO call: 

csend(MSGTYPE , buf, sizeof(buf), succ(mynode()), 0); 


Example Application: Calculating pi 

This application uses an n-point quadrature rule to evaluate the following definite integral: 


7t 


1 

J o(l+* 2 ) 


dx 


Admittedly, using the power of an Intel supercomputer for such a simple application is overkill, but 
the application demonstrates concepts that ate just as valid for more challenging problems. 
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Here is a sequential program (written in Fortran) that evaluates the above integral. The source for 
this program is available on the Intel supercomputer in lusrlsharelexamples!fortran!pi!piserialf 
Note that the user interface consists only of a read statement that solicits the number of intervals. 

program piserial 

double precision h,sum,x,pi,f,a 

integer n 

c Define the function 

f(a) = 4.0d0/(l.d0 + a*a) 

c Input the number of intervals. 

1 print *,' Enter number of intervals:' 

read(5, *,end=10 0) n 

c Calculate the scaling factor 
h = l.dO/n 

c Integrate. The value of x used to calculate the slice is 

c the value at the midpoint of the integration slice, 
sum = 0.dO 
do 10 i = l,n 

x = h * (dble(i) - 0.5d0) 
sum = sum + f(x) 

10 continue 

pi = h * sum 

c Output the answer 

print The value of pi for',n,' intervals is',pi 

goto 1 

c 

c Terminate 

100 stop 

end 

In the parallel version of this program, each node performs a portion of the integration. The 
decomposition is a domain decomposition that “deals out” the work, as illustrated in Figure 7-2. For 
example, if you choose 16 nodes and 512 points, each node gets 32 points. The first point goes to 
node 0, the second point goes to node 1, and so on through the 16th point, which goes to node 15. 
The 17th point goes to node 0, the 18th point goes to node 1, and so on until all the points have been 
dealt out (It is not strictly necessary to deal out the work in this way, because the integration work 
is evenly balanced. However, since the data is calculated by each node, it is just as easy to deal out 
as not, and this example deals out the data to give you an example of this technique.) 
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Figure 7-2. The Decomposition Used for the pi Example 
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Here is the parallel version of the program. The source for this program is available on the Intel 
supercomputer in / usrlshare!examplesJfortranlpifpinode.fi differences from the serial version are 
shown here in boldface. 

program pinode 
include 'fnz.h 1 

double precision h,sum,x,pi,f,a, tmp 
integer n 

integer nodes, iam, intsiz 

data intsiz / 4 / 

c Define the function 

f(a) = 4.0d0/(l.d0 + a*a) 

c Do some bookkeeping 
iam » mynode() 
nodes - numnodes() 

1 if(iam .eq. 0) then 
c Input the number of intervals. 

print *,* Enter number of intervals: 1 
read(5,*,end=100) n 
call csend(300,n,intsiz,-1,0) 
else 

call crecv(300,n,intsiz) 
endif 

c Calculate the scaling factor 
h = l.dO/n 

c Integrate. The value of x used to calculate the slice is 
c the value at the midpoint of the integration slice, 
sum = 0.dO 

do 10 i = iam+l,n,nodes 

x = h * (dble(i) - 0.5d0) 
sum = sum + f(x) 

10 continue 

pi = h * sum 

call gdsum(pi,l,tmp) 

if(iam .eq. 0 )then 
c Output the answer 

print The value of pi for*,n,' intervals is',pi 
endif 

goto 1 
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c Terminate all nodes 
100 1 - kill(0, 9) 

end 

Note that the parallel version is not much longer than the sequential version. Note also that the 
decomposition takes place entirely in the do statement. The sequential version is: 

do 10 i = l,n 

while the parallel version is: 

do 10 i * iam+l,n,nodes 

If you run the application on more nodes, you don’t have to change one line of the node program! 

In the parallel version, only node 0 interacts with the user. The other nodes do only calculation. If 
the print and read statements were not surrounded with if(iam ,eq. 0)then... endif statements, then 
when you ran the program on 100 nodes you would have to input the number of intervals 100 times 
and see the answer 100 times! 


Example Application: Matrix*Vector Multiplication 

The following example computes the matrix-vector product y- Ax, where A is an n x n matrix and 
x and y are vectors with n components. In addition to the numerical technique, this example 
illustrates the use of the parallel file I/O calls. 

The matrix A is assumed to be too large to fit in the node’s memory, requiring an “out-of-core” 
multiplication. For simplicity, n, the number of rows in the matrix, is assumed to be divisible by p, 
the number of nodes in the application. The number of rows per node, nip, is referred to as m. 

The problem decomposition is again a domain decompositioa Each node collects all of x, but then 
takes only a portion of A (specifically m rows) to form its portion of the product vector. There is no 
attempt to “deal out” the rows of A. 

The vector x is initially divided among the nodes. (This example assumes that each node has 
obtained its portion of jc before this routine is called.) Each node contains m components of x. Node 
0 has components 1 through m; node 1 has components m +1 through 2 *m, etc. (In general, node Z 
has components (Z-l)*m through Z*m.) The answer, the vector y, will be stored in the same way. 

The matrix A, which is too large to fit in a single node’s memory, is also divided among the nodes. 
It is initially stored in a file called matrix. The elements of the matrix are stored in the file by rows, 
as follows: 

A(l,l), A(U),... A(M), A(2,l), A(2,2),... A(2,n),... A(n,l), A(n,2),... Afoil) 
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Each row of the matrix A has n elements of length REALSIZE bytes, and so each row takes up 
n*REALSIZE bytes in the file. Each node is responsible form rows in the matrix; it reads its portion 
of the matrix from the file by first moving the file pointer to mynode()*m*n*REALSIZE bytes from 
the beginning of the file, then reading m rows of n*REALSIZE bytes each beginning at that point. 

Here is the code that collects jc, reads the node’s portion of A, and performs the multiplication: 

subroutine matvmul(m, n, x, y, xtotal, arow) 

integer REALSIZE 

parameter(REALSIZE = 4) 

integer ncnt, fileptr, xlens(128) 

integer m, n 

real x(m), y(m), xtotal(n), arow(n) 
c 

c m is n/p where n is the dimension of A 
c and p is numnodes() 
c 

c Collect all of x on each node. 
do 3 i = 1, numnodes() 
xlens(i) = m*REALSIZE 
3 continue 

call gcolx(x, xlens, xtotal) 
c 

c Open the file and seek to the appropriate location 

open(unit=10, file = 'matrix', 

+ form = 'unformatted') 

fileptr = lseek(10, mynode()*m*n*REALSIZE, 0) 
c 

c Read the rows and use the BLAS call sdot () to do 
c the multiplication, 
do 10 i = 1, m 

call cread(10, arow, n*REALSIZE) 
y(i) = sdot(n, arow, 1, xtotal, 1) 

10 continue 


This subroutine takes the following parameters: 

m The size of each node’s portion of the matrix A and the vector x (nip). 

n The number of rows and columns in the entire matrix A and the number of 

elements in the entire vector x. 

x This node’s portion of the vector x (m elements). 
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y This node’s portion of the result vector y (m elements). 

xtotal A temporary array used to hold the entire vector x ( n elements). 

arow A temporary array used to hold one row of the matrix A (n elements). 

Tte subroutine first calls gcoIxO to collect the nodes’ portions of x together into the array xtotal. It 
then opens the file containing A, moves the file pointer to the beginning of the section of the file that 
belongs to this node, and then reads m rows from the file. After reading each row, it uses the BLAS 
(Basic Linear Algebra Subroutines) routine sdot() to perform the dot product between the current 
row and the vector x, storing the result (a scalar) into the appropriate element of the vector y. 


NOTE 

You must use the -Ikmath switch on the if77 command line to link 
in the library that contains sdotQ- 


See the Paragon™ Fortran System Calls Reference Manual for more information on gcoIxO; see 
Chapter 5 for information about parallel file I/O; see the CLASSPACK Basic Math Library User’s 
Guide or CLASSPACK Basic Math Library/C User’s Guide for more information on sdotO- 


Example Application: The N-Queens Problem 

This application collects all the board configurations that solve tte N-Queens problem. This problem 
is: “Given an N x N chess board, where can you place N queens so that no queen can capture any 
other?” In chess, queens attack in straight lines along tte X, Y, and diagonal directions. 

The N-Queens problem is typical of problems for which there is no analytical solution. Instead, there 
exists a large set of candidate solutions. You test each solution and accept those that pass. 

Tte difficulty lies in the enormous size of tte candidate set. For example, an 8 x 8 chess board has 
64 squares. Tte total number of possible positions for 8 queens can be represented as the 
combination of n-64 things taken m-8 at a time. Tte formula for tte number of combinations is: 

ni / ( ml * (n-m)! ) 

which evaluates to 2 s2 possibilities. Even on a state-of-the-art sequential computer, it would take 
several hours to check every one of those combinations. 

Even before you begin thinking about an algorithm, however, you can eliminate a large number of 
possibilities. For example, any solution that has more than one queen in tte same column is invalid. 
This reduces tte number of possibilities to 8® or 2 24 . 
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This section shows how to use an Intel supercomputer to evaluate those 2 24 possibilities. You can 
arrange the possibilities into a tree. The technique involves following a tree down until it either 
reaches a dead end (an invalid state) or until it reaches a leaf (a valid solution). Figure 7-3 illustrates 
such a tree. To make the figure simpler, the chess board is shown as 4 x 4. Instead of2 24 possibilities, 
you have 2 8 . 

The root of the tree (the zero level) is the null board — no queens present. The next level (the first 
level) consists of states where a queen is in each of the positions that make up the first column. In 
Figure 7-3, there are four of those. In an 8 x 8 board, there would be eight. 

The next level (the second level) consists of states with two queens on the board, one in the first 
column and one in the second. In Figure 7-3, there are four of those under each second level state. 
Notice, however, that some states are already invalid. There is no need to follow the tree any further 
down this branch. In Figure 7-3, the two leftmost states in the second level are invalid. The second 
state in the first level has three dead ends in its second level. 

You can see how the algorithm is going. Some paths are going to finish early because they reach 
dead ends. Others are going to take longer and reach the solutions at the leaves. This is a problem 
for control decomposition. 

Manager/worker decomposition (a type of control decomposition) is a useful way of achieving 
balanced computational loads when die application consists of a large number of tasks that are of 
varying length. Because there is no way of determining up front what the length of the task is, the 
method consists of dividing the application into a large number of tasks (more than the number of 
nodes) and then assigning tasks to individual nodes as the node becomes available. 

One way of generating the task is for the manager node to follow the tree down until the number of 
states is larger than the number of available nodes. As a further enhancement, the manager node may 
even enlist the aid of the other nodes when doing this initial processing. 

Then, the manager node assigns a state to a node. The node follows that state down the tree and 
collects all the possible solutions. When the node finishes, it reports its solutions, if any, and requests 
more work. In the case of a 4 x 4 board, the tree is shallow and there are only two solutions. An 
8x8 board results in 92 solutions. 

The directory lusrlsharelexamples!clnqueetts contains a C version of the 8 x 8 8-Queens problem. 
The example is written in C because the N-Queens algorithm makes use of recursion. 

In this example, a task is represented as a partially-filled board (only the first few columns contain 
queens) given to one of the nodes. The example as described here runs on four nodes. Node 0 is the 
manager, and nodes 1 through 3 are the workers. The manager is assigned a certain number of 
columns (in this example, two) and creates partial boards by placing queens on the board, one for 
each column it is assigned. When the manager controls two columns of an 8 x 8 board, it creates 64 
partial boards. 
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Figure 7-3. The N-Queens Solution Tree for a 4 x 4 Board 
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Also, in this example, the manager does not create the boards intelligently. For example, the 
manager will create a board with two queens in the same row. If a worker gets a partial board that 
contains invalid queens (such as two queens in the same row), the worker immediately throws the 
board away and requests another. 

The manager creates boards by counting in a radix equal to the number of rows in the board. Each 
digit in the resulting number represents a column with the least significant digit being column 0. The 
value of the digit is the row position of the queen. Hence, 00 represents two queens in row 0, and 01 
represents one queen in row 0 of column 0 and another queen in row 1 of column 1. 

The workers signal their availability by sending a “ready” message to the manager. This is a zero 
length message of type READY. When the manager receives a READY message, it determines who 
sent it, then sends a partial board to that node as a message of type TASK. The manager keeps doing 
this until it has no more partial tasks to assign. Finally, the manager waits until all workers are idle 
(that is, it receives a READY message from every worker) and then sends a final message with the 
special value FINISHED to all workers. 

Here are the key lines that implement the manager control. 

/* This is the manager part */ 

if (!iam) { /* If I am node 0 */ 

printf("\n\n\n"); 
printf(”\nSTARTING ... \n"); 

/* Manager keeps a count of how many workers are available 
and sends out boards to a worker when the worker identifies 
itself as READY. The manager uses the routine getjx>ard() to get 
a new board. There are no more new boards when this routine 
returns DONE.*/ 

while ( get_board(board) != DONE ) { 
crecv(READY,NULL,0); 
nodenbr = infonode(); 

msgcount++; /* Count how many nodes are ready */ 
csend(TASK,board,sizeof(twoD),nodenbr,0); 
msgcount--; /* When a node gets a task, it is no longer 
ready for another. Hence, decrease 
msgcount */ 

} 

/* Wait for all workers to be free (the msgcount must be equal 
to the number of worker nodes) */ 

while(msgcount != nodes-1) { 
crecv(READY,NULL,0); 
msgcount++; 

'} 
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/* Send the FINISHED message to all nodes and then say goodbye */ 

board[0][0] - FINISHED; 

csend(TASK,board, sizeof(twoD),-1,0); 

goodbye(); 

} 

The manager does not know if a worker has found a solution or not, and the workers do not know 
how many initial boards there are. When a worker receives a partial board, it first checks for the 
special value FINISHED, and calls goodbyeO if it finds this value. (The goodbyeO routine prints 
a summary message in the output file, closes the file, and exits.) Next, the worker checks that the 
queens already on the board are valid. If they are, the worker finds all the solutions that exist with 
that partial board by recursively calling move_to_right(). When the worker finds a solution, it writes 
the solution to a file called queens.out. This-file was opened by all nodes in mode M_LOG (1), 
which is the mode in which all nodes have a common file pointer and access the file on a first-come 
first-served basis. 

Here are the key lines that implement the worker control, 
else { 

/* This is the worker part. */ 

/* Each node enters an infinite loop where it receives a partial 
board and checks whether that partial board contains valid 
queens. If the board contains a FINISHED message, the node 
cleans up and exits by calling goodbye(). If the board contains 
invalid queens, the node considers itself done with the task. 
Otherwise, it tries to place a queen in the next column by calling 
move_to_right(). This routine will find all possible solutions 
given the initial board. */ 

for(;;) { 

csend(READY,0,0,0,0); 
crecv(TASK,board,sizeof(board)); 
if(board[0][0] — FINISHED) { 
goodbye(); 

} 

if ( chk_board(board) ) { 

move_to_right(board,0, MCOLS) ; 

} 

} 

} /* end of else */ 

There are many opportunities for optimizing this algorithm. For example, you could write the 
manager in such a way that it only gave workers boards that had the potential of containing one or 
more solutions. In addition, the manager could mark positions on the board that are invalid due to 
the presence of the initial queens, and the worker would not have to check those. 
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The file queens.out contains copies of all the 92 solutions for the 8-Queens problem. Each board is 
preceded by a header that identifies the node that found the solution and the number of solutions 
found so far by the node. Finally, the total number of solutions is printed. The tail of the file looks 
as follows: 


Node 1 found solution 30 


01234567 
0- Q -- 

1 .Q * - 

2 .Q 

3 - - Q. 

4 Q. 

5 ------ Q - 

6 -Q- 


7 - Q-- - - 

Node 2 found solution 31 


01234567 
0 - - - - Q - - - 

1 .Q- 

2 - - - Q-- 

3 Q ------ - 

4 - - Q - - - - - 

5 - --Q 

6 - - * - - Q - - 

7 - Q -. 


Node 3 found solution 31 


01234567 
0 - - - - Q - - - 

1 - - Q. 

2 - -.Q 

3 -Q - - - - 

4 .Q - 

5 Q- 

6 .Q - - 

7 - Q -- 


Total solutions * 92 

If you want to investigate another manager/worker application, look at the triangle program in 
Iusrlshare!examples!cl triangle. Its operation is described in a README file. 
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Introduction 

This chapter presents some techniques you can use to improve the performance of your parallel 
applications. It includes the following sections: 

• Single Node Performance. 

• Multi-Node Performance. 

• I/O Performance. 

In general, however, the best thing you can do to improve performance is to choose an efficient 
numerical method and algorithm for solving your problem. A good numerical method and an 
efficient algorithm will always give better performance than a poor method and algorithm. This is 
true even if the good method is implemented in a high-level language and the poor method is 
implemented in hand-coded assembly language. 

Another general performance technique is to use the Paragon system’s profiling and performance 
analysis tools to help pinpoint the parts of your application that could benefit the most from 
optimization. See the Paragon "* Application Tools User’s Guide for information on the available 
tools. 
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Single Node Performance 

This section discusses things you can do to increase the speed of calculation (MFLOPS or GFLOPS) 
on each node. Many of these are general performance-improvement techniques that you can use on 
any computer, some of them are specific to the i860® microprocessor. Techniques discussed in this 
section include: 

Use profiling tools. 

• Avoid repeated use of system calls. 

• Avoid virtual memory paging. 

• Use compiler optimizations. 

• Increase problem size. 

• Access contiguous memory locations. 

• Use caching wisely. 

• Use optimized libraries. 

• Use assembly language subroutines. 

• Avoid error checking (C language only). 


Use Profiling Tools 

The Paragon system comes with the prof and gprof profilers, and their graphical versions xprof and 
xgprof. You can use these tools to help track down the parts of your application that are consuming 
the most time and concentrate your optimization efforts there. See the Paragon ™ Application Tools 
User’s Guide for more information on these tools. 


Avoid Repeated Use of System Calls 

Don’t make a system call twice if once will do. This is an obvious performance improvement 
technique, but unfortunately it is missing ifom many applications. For example, a process may need 
its node number and process type to do message passing. Avoid using mynodeO and myptypeO 
each time you need those numbers. Instead, invoke each once and store their values in variables. 
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Avoid Virtual Memory Paging 

The Paragon OSF/1 operating system provides virtual memory, which lets you use more memory 
than is physically available on the node. When a program tries to allocate a memory space that is 
larger than the node’s available free memory, one or more 8K-byte virtual memory pages that 
haven’t been referenced recently are paged out. This means that their contents are written to disk and 
replaced with the new data. Later, when the program references data in the paged-out memory 
section, a different section is paged out and the old data is paged in (read back from disk) in its place. 

Although virtual memory makes it possible for the system to support multiple users and very large 
programs, you should try to avoid it when you can. Accessing pages of virtual memory that are not 
currently paged in is much slower than physical memory and generates a lot of disk activity. Try to 
reduce the memory used by your application until it fits in physical memory, including 
dynamically-allocated buffers and system message buffers (see “Understand Message-Passing Flow 
Control” on page 8-13 for information on the sizes of system message buffers). 

You can use the vmstat command to get information about your application’s memory usage. See 
the Paragon ™ Commands Reference Manual for information on this command. 

Once you have reduced your application so that it fits in physical memory, you may be able to use 
the -plk switch to lock parts of your application into physical memory. This reduces paging and 
improves message-passing latency, but has certain consequences; see “Process Locking” on page 
8-15 for more information. 


Use Compiler Optimizations 

When you compile a program, you can use compiler optimization switches to tell the compiler what 
techniques to use to optimize your code. Optimization can produce a compiled program that does 
the same work in less time by making better use of the processor’s special features. However, 
optimization can sometimes produce a program that runs more slowly or produces wrong answers, 
so it must be used carefully. 
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The compiler optimization switches you can use and compiler-specific code changes you can make 
depend on which language you program in and the revision level of the compiler. See the Paragon™ 
Fortran Compiler User’s Guide or Paragon ™ C Compiler User’s Guide for complete information 
on compiler optimizations for your specific compiler. However, here are a few general hints: 

• Experiment with the -O switch, which controls the level of compiler optimization: 

Level 0 performs no optimizatioa 

Levels 1 and 2 perform straightforward optimizations that should always result in 
improvement. 

Levels 3 and 4 attempt to make use of the i860® microprocessor’s pipelining and 
dual-instruction modes to improve performance; whether or not they improve your 
program’s performance, and by how much, depends on the characteristics of your program. 

In some cases, different parts of the program should be compiled with different optimization 
levels. 

• Try the -Mvect switch to invoke the vectorizer. The vectorizer attempts to rearrange your code 
to allow more efficient use of pipelining. You can get better results out of the vectorizer if your 
innermost loops have the following characteristics: 

The loop index increments the first dimension in Fortran, or the last dimension in C. 

Arrays are accessed with unit stride. 

The number of iterations within the loop is not too small. 

Also, if tests and subroutine calls should be avoided within the loop. See the Paragon™ Fortran 
Compiler User’s Guide or Paragon C Compiler User’s Guide for specific examples of code 
changes you can make. 

• If your application does not depend on strict IEEE semantics for mathematical operations, try 
the -Knoieee switch. This switch provides much faster mathematical operations than those 
provided by the default IEEE math library, but may result in slightly decreased accuracy and 
different behavior in exceptional circumstances (operations on 0 or infinity and NaNs). 

• Use the -MnostrideO switch, unless your program accesses arrays with zero stride (that is, 
incrementing the array pointer by 0 in each loop iteration). There are some important compiler 
optimizations that are only possible if the compiler knows the code does not do this. 
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Increase Problem Size 

Once you have optimized a program’s single-node performance, you may find that running the 
program on a larger number of nodes with the same data set gives a lower per-node performance. 
This can occur because the per-node vector size has gone down, reducing the efficiency of the i860 
microprocessor’s pipelines. To avoid this problem, you can increase the problem size as you increase 
the number of nodes, or you could even write two different inner loops—one optimized for short 
vectors, the other optimized for longer vectors. 


Access Contiguous Memory Locations 

Whenever you access memory, try to access contiguous memory locations. In particular, whenever 
your program reads or changes the value of an array element in memory, try to be sure that the next 
array element it reads or changes is adjacent to the previous one in memory. This is important 
because the i860 microprocessor accesses memory in 4K-byte physical memory pages. Once you 
have read from a physical page, another read from the same page takes only one more cycle, but a 
read from a different page takes 10 to 14 more cycles. Every cycle spent switching from one page to 
another is a cycle that can’t be used for calculation. 

To keep your memory accesses within a physical page, you can use some of the following 
techniques: 

• Group a series of memory reads out of the same array. 

• Do consecutive references across the rows (C) or down the columns (Fortran) of matrices. (In 
C the rightmost index of an array varies the fastest, while in Fortran the leftmost index varies 
fastest. This means that, for example, if you distribute the elements of a two-dimensional array 
among the nodes, you should give out rows in C and columns in Fortran.) 

• “Strip-mine” loops, so that you do several accesses to the same array at a time. For example, 
you should read several elements from vector A, then several from B, instead of reading All], 
B[l], A[2], B[2], and so on. 

You should also try to group reads and writes. Once you have read from a page, a following write 
takes about 6 more cycles than a following read. Switching from write to read also takes about 6 
cycles. 


Use Caching Wisely 

The i860 microprocessor has a 16K-byte data cache for recently-accessed memory locations. 
Whenever you read or change a bit in memory, a 32-byte area of memory containing that bit is 
copied into the cache. (This 32-byte area is called a cache line and always begins on a 32-byte 
boundary.) When you access memory that is already in the cache, the access is very fast. However, 
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whenever a new cache line is copied into the cache, cache lines that have not been accessed recently 
are written back to memory (if necessary) and removed from the cache to make room. Try to arrange 
your code so that all operands for a loop can be accommodated in the cache at the same time. 

The i860 microprocessor also has an instruction cache (4K bytes on the i860 XR, 16K bytes on the 
i860 XP) which is used to hold program instructions once they have been fetched and decoded from 
memory. You can use this cache in two ways: 

• Try to keep your loops small. If the code for an entire loop fits in the instruction cache, the loop 
can execute very quickly. 

• In an if/else block, try to put the code that is used more often in the if part and the code that is 
used less often in the else part. The instruction cache works in a “lookahead mode,” and when 
pre-fetching instructions will fetch the code immediately following the if. If the else branch is 
executed instead, the if branch code must be discarded from the cache. 

Both these techniques can be used in high-level languages as well as assembly code. 

Note that the data cache, the instruction cache, the physical memory page, and the virtual memory 
page are separate functions that have different sizes and different effects. In general, cache 
management is handled by the compiler, but you should try to arrange your code to make the 
compiler’s job easier. 


Use Optimized Libraries 

Several optimized libraries of math and utility functions are available with Paragon OSF/1. These 
libraries have been carefully hand-tuned to give the best possible performance; you can save time 
and increase efficiency by using routines from these libraries rather than writing the equivalent code 
yourself. The available libraries include: 

• The Basic Math Library ( libkmath.a ). This library is a standard part of Paragon OSF/1; it 
includes optimized BLAS (Basic Linear Algebra Subroutines) and FFT (Fast Fourier 
Transform) routines. See the CLASSPACK Basic Math Library User’s Guide or CLASSPACK 
Basic Math LibrarylC User’s Guide for more information. 

• The Signal Processing Library ( libsigml.a ). This library is an optional product; it includes 
optimized vector and signal-processing routines. See the CLASSPACK Signal Processing 
Library User’s Guide or CLASSPACK Signal Processing LibrarylC User’s Guide for more 
information. 

Note that these are single-node libraries ; they improve the numeric performance of each node of 
your program, but do not affect its multi-node performance. 


8-6 



Paragon™ User's Guide 


Improving Performance 


Use Assembly Language Subroutines 

Re-writing key routines in the i860 microprocessor’s assembly language can sometimes bring 
significant performance benefits. See the Paragon “ i860® 64-Bit Microprocessor Assembler 
Reference Manual for information on the assembler. 


Avoid Error Checking (C Language Only) 

In C, there are two versions of most calls in libnx.a : the standard version and the underscore version 
(for example, the underscore version of crecvO is _crecv()). When you call the standard version, the 
call checks for certain error conditions before it returns; if an error is detected, the call terminates 
your program with an error message. The underscore version works the same as the non-underscore 
version, but if an error occurs, the call simply returns the value -1 and sets the external variable errno 
to a value that describes the error. This is useful if you want to handle an error yourself and not let 
the system do it But if you are confident that your program is working, you may choose to use the 
underscore version and not check the return value, thereby improving performance. (If an error does 
occur, unexpected and difficult-to-debug behavior will result, so use this technique with caution.) 


Multi-Node Performance 

This section discusses things you can do to increase the efficiency of applications running on 
multiple nodes, including: 

• Use dynamic memory allocation for large arrays. 

• Avoid serializing calls. 

• Use ParaGraph. 

• Maintain data locality. 

• Overlap computation and communication. 

• Avoid message buffering. 

• Align application buffers. 

• Understand message-passing flow control. 
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Use Dynamic Memory Allocation for Large Arrays 

You should always use dynamic memory allocation for large arrays. Dynamic memory allocation 
means allocating the memory for the array at run time, using the ALLOCATE statement in Fortran 
or the mallocO call in C. The alternative, static memory allocation, means declaring the array in the 
program source. 

Dynamic allocation is important even on one node, but becomes more and more important as the 
number of nodes increases. The larger the array and the more nodes, the more performance can be 
improved by using dynamic memory allocation. If the array or number of nodes is large enough, the 
application may not run at all unless you use dynamic memory allocation. 

For example, the following program fragment uses static memory allocation. It simply creates two 
4M-byte arrays of real*4 (one in a common block, the other not) and initializes each element of each 
array to the element number. 

parameter huge_jsize = 1024*1024 

real*4 huge(huge_jsize), huge_common(huge__size) 

common /giant/ huge_common 

integer i 

do 10 i=l,huge_size 
huge(i) = i 
huge_common(i) = i 
10 continue 

The equivalent code with dynamic memory allocation is as follows (changes are shown in boldface). 
With 16M bytes of memory on each node, this version runs as much as ten times as fast as the 
previous version on one node; it runs as much as fifteen times as fast on eight nodes. The more nodes, 
the greater the speedup. 

parameter huge__size = 1024*1024 

real*4 huge (huge__size), huge_common(huge_size) 

pointer(p, huge) 

common, allocatable /giant/ huge_common 
integer i 

allocate(huge, /giant/) 

do 10 i=l,huge_size 
huge(i) = i 
huge__common (i) = i 
10 continue 

deallocate(huge, /giant/) 
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Note that a common block must be declared ALLOC AT ABLE before it is allocated with an 
ALLOCATE statement A variable or array that is not part of a common block must be declared as 
a pointer-based variable with a POINTER statement before it is allocated with ALLOCATE; the 
corresponding pointer variable, in this case p, does not have to be used. See the Paragon * Fortran 
Language Reference Manual for more information on these statements. 

The reason statically allocated arrays cause your program to run slowly is that, since they are 
compiled into the program, the initial contents of the array must be obtained from the executable 
program on disk. Whenever a process on a compute node reads or changes a byte in a memory page 
of statically allocated data that it hasn’t touched before, the data for that page may have to be paged 
in. (See “Avoid Virtual Memory Paging” on page 8-3 for an introduction to virtual memory paging.) 
A message requesting the initial contents of that page is sent to the node in the service partition 
where the compute process’s parent process is running. The entire page—8K bytes—is then sent 
back to the compute process. This occurs even if the statically allocated data is uninitialized (all 
zeroes). 

Sending these pages across the mesh takes time. Even worse, if many compute processes all want 
pages at the same time, the parent process’s node can become overwhelmed, slowing the application 
drastically. The effect is magnified if the statically-allocated array is so large that parts of the 
operating system have to be paged out to make room for it; in this case, pages of the operating system 
have to go out at the same time pages of static data are coming in. 

When you use ALLOCATE or mallocO to dynamically allocate an array, the memory is not 
associated with the program on disk. Instead, each node has its own copy of the array, and it doesn’t 
have to be paged in. When a compute process touches a page of dynamically-allocated data it hasn’t 
touched before, the page is simply allocated from the available node memory—no messages are sent. 
This greatly reduces traffic on the mesh and increases the performance of your application. 


Avoid Serializing Calls 

Avoid using serializing calls repeatedly or on many nodes. A serializing call is one that relies on a 
single resource which can only service one request at a time (typically a daemon or server on the 
boot node). Using a serializing call once takes little time, but if many nodes in a large application 
call it at the same time the boot node can only service these requests one at a time. Each node must 
wait until the boot node services its request, which can cause the entire application to run slowly. 

Many calls that perform I/O or make use of the file system, such as statO, chdirO, and chmodO, are 
serializing calls, because they must communicate with the root file system server on the boot node. 
getrusageO is also a serializing call, because it sends a message to all the I/O nodes to get 
information on the caller’s I/O activity. You can detect the presence of serializing calls in your 
program by profiling it. If common operations, especially I/O operations, are taking much more time 
than expected, they may be serializing calls. 

Whenever possible, avoid overuse of serializing calls by having only one node make the call. For 
example, instead of having every node process call staff), have one node call staff) and then use 
gisum() to distribute the information to the other nodes. Also, instead of having every node process 
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call chdirO when it starts up, have the controlling process call chdirO before creating the node 
processes. You can avoid serialization in I/O by using the I/O mode M_RECORD or M_GLOBAL, 
as described under “Use the Appropriate I/O Mode” on page 8-24. 


Use ParaGraph 

The Paragon system comes with the ParaGraph performance visualization tool. You can use this 
graphical tool to help analyze your application’s message passing behavior and determine where to 
concentrate your optimization efforts. See the Paragon ™ Application Tools User’s Guide for more 
information on ParaGraph. 


Maintain Data Locality 

Wherever possible, try to distribute the data to the nodes so that each node has all the data it needs, 
and does not have to get any data from other nodes. Where it is not possible to keep all related data 
on one node, try to keep the data as close as possible to the nodes that need it. 

For example, suppose you are writing a simulation where the value of each data point in a plane is 
computed from the values of nearby data points. To parallelize this simulation, you would divide up 
the plane into segments and assign each segment to a node. However, each node must communicate 
with other nodes to get data for points that are just past the edge of its data segment. Since the 
Paragon system has a mesh architecture, you would typically divide up the plane in a 
two-dimensional decomposition (rectangles). You would then assign the rectangles to nodes in such 
a way that neighboring segments are the responsibility of neighboring nodes. (Use the 
nx_app_rect() call to determine the “shape” of your application.) This will reduce message traffic 
and ensure that each message reaches its destination as quickly as possible. 


Overlap Computation and Communication 

The Paragon OSF/1 message-passing calls are available in both synchronous versions (call names 
beginning with c) and asynchronous versions (call names beginning with i). Synchronous calls do 
not return until the message-passing operation is complete; asynchronous calls return immediately, 
giving you a message ID that you can use to check when the operation is complete. 

Although the synchronous calls are easier to use and have slightly lower overhead, you should use 
the asynchronous calls whenever the results of the call are not needed immediately. Using 
asynchronous calls can let your application do useful computation in the time when it would 
otherwise just be waiting for a message to arrive. During this time, the node’s message coprocessor 
can process the communication without interrupting the main processor. 
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Avoid Message Buffering 

Try to avoid message buffering whenever possible. For example, assume that you have the same 
process running on two nodes and that these processes must exchange information. Each process 
must issue a receive and a send. If a csendO call is executed before its corresponding crecvO, the 
message is sent and buffered in a system buffer (if there is not enough space in system buffers, the 
sender blocks). When the crecvO is executed, the message is copied from the system buffer into the 
application buffer. (For detailed information on message buffering, see “Understand 
Message-Passing How Control” on page 8-13.) 

Your code runs more efficiently if you can avoid the system buffer and copy the message directly 
into the application buffer. You can do this by using an irecv() (the asynchronous receive) and 
posting the receive before the corresponding csend(). Remember, however, that because the nodes 
do not run in lock step, coding the irecv() before its corresponding csendO does not guarantee that 
the irecvO is executed before its csendO (even if the same program runs on every node). You can 
make sure that the irecvO is executed before the csendO by using zero-length messages to 
synchronize the nodes, as shown in the following example. 

For example, consider the following C routine, shadow. This routine might appear in an application 
that needs to have the nodes exchange the rows of a matrix (a Fortran version would probably 
exchange columns instead). 

A typical application for shadow might be a Gauss-Seidel iteration or any technique based on nearest 
neighbor interactions. The application processes a two dimensional array called r[][] and exchanges 
rows between nodes. The first index in the array represents a row, and it is passed as a pointer. Each 
node contains a horizontal partition of the array with range rows. It has a top buffer (s[0]) and a 
bottom buffer (s[range+l]) containing the boundary rows from other nodes. 

void shadow( 

long topnode; 
long botnode; 
int (*S) [MAX_LATTTCE] 
int range) 

{ 

long topid, botid, syncbotid, synctopid; 

/* Node sends upper boundary row s [1] to the bottom buffer 
s[range+l] of the node controlling the upper partition, 
(topnode). 

Node sends lower boundary row strange] to the top buffer s[0] 
of the node controlling the lower partition, botnode */ 

topid = irecv(TOP, s[0], sizeof(s[0])); 

botid = irecv(BOT, s[range+l], sizeof(s[range+l])); 
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/* The following code ensures that the csend()s corresponding 
to the above irecv()s are not executed until all the irecv()s 
have been posted. */ 

syncidtop = irecv(SYNCTOP, 0, 0); 
syncidbot = irecv(SYNCBOT, 0, 0); 
csend(SYNCTOP ,0,0, topnode, 0 ) ; 
csend(SYNCBOT ,0,0, botnode, 0 ) ; 
msgwait(syncidtop); 
msgwait(syncidbot); 

/* End of synchronization code. */ 

csend(BOT, s[l], sizeof(s[1]), topnode, 0); 

csend(TOP, strange], sizeof(s[range]), botnode, 0); 

msgwait(topid); 
msgwait(botid); 

} 

Note that when the data sends are performed the asynchronous receives have already been executed. 
This is ensured by the zero-length synchronization messages. The data then goes directly into the 
application buffer (unless it is paged out, as discussed later in this section). 

Another way of achieving synchronization is to issue a gsync() after the irecv()’s. However, gsync() 
can be expensive—it synchronizes all the nodes, when all that’s really necessary is to synchronize 
senders and receivers. A good rule of thumb is to synchronize only what is really necessary. 


Align Application Buffers 

Try to ensure that send and receive buffers are properly aligned and sized whenever possible. 
Although the message-passing system calls will work with any size or alignment of buffers, the 
hardware works best with well-aligned buffers. The software may have to copy messages that are in 
misaligned buffers to new, aligned buffers, which decreases performance. There are several degrees 
of alignment. All other things being equal: 

1. The best performance can be achieved by aligning the send or receive buffer on a 4K-byte 
boundary (this means that the buffer’s address is an even multiple of 4K). This corresponds with 
the i860 microprocessor’s 4K-byte physical memory page. 

2. Good performance (only slightly worse) occurs if the buffer is aligned on a 32-byte boundary, 
which corresponds with the microprocessor’s cache line, but crosses a 4K-byte memory page 
boundary. 

3. The next-best performance (not nearly as good) occurs if the buffer is aligned on an 8-byte 
boundary, which corresponds with the microprocessor’s FIFO size. 
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4. Some performance improvement can be seen if the buffer is aligned on a 4-byte boundary. 

5. The worst performance comes when the buffer is not even aligned on a 4-byte boundary (that 
is, its address is not a multiple of 4). 

To be sure the buffer is well-aligned, you should use mallocO to allocate it rather than allocating it 
statically. (This call is available in both Fortran and C.) Buffers allocated with the standard mallocO 
or its derivatives will always be aligned on a 32-byte boundary. 

You can arrange for the pointer to a message buffer to be on a 4K-byte boundary by using pointer 
arithmetic. This technique can be used even if the buffer is statically allocated. You do this by 
declaring your buffer to be 4095 (4K-1) bytes longer than needed, adding 4095 to the buffer pointer, 
and then ANDing the pointer with the NOT of 4095. For example, assume that you need a 50K-byte 
buffer called buf. You can cause the buffer pointer bufp to be on a 4K-byte boundary by declaring 
buf to be 54K-1 bytes and doing the following pointer arithmetic: 

char *bufp, buf[55295]; 


/* bufp points to the nearest 4K-byte boundary in the buffer */ 
bufp = (char *) (((int)buf + 4095) & -4095); 

Finally, when you set up C structures that you intend to use as messages, try to minimize padding 
(open areas within the structure inserted by the compiler to make the following structure element 
properly aligned). To do this, you should be aware of the sizes and alignments of the different data 
types; in general, you can minimize padding by placing the larger data types first. Reducing padding 
reduces the number of empty bytes sent with the message. 


Understand Message-Passing Flow Control 

Whenever you send a message, the sending process and the receiving process use message-passing 
flow control to make sure that the message is safely stored in memory when it arrives at the 
destination node. This flow control guarantees that messages flow from their source to destination 
without blocking on the mesh. If you understand flow control, you can select the message sizes, 
message-passing configuration parameters, and message synchronization techniques that give the 
best possible message-passing performance for your application. 

Note, though, that the most important thing you can do to improve message-passing flow control is 
to avoid message buffering (as discussed under “Avoid Message Buffering” on page 8-11). You 
should only read this section if you cannot avoid message buffering and need to improve its 
efficiency, or if you really want to understand all the nitty-gritty details of message-passing flow 
control. If neither of these applies to you, skip to “Recommendations” on page 8-21. 
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Overview of Message-Passing Flow Control 

Here’s an overview of what happens when you send a message. (This is a simplified view; the 
low-level details of message-passing flow control are proprietary and subject to change.) 

1. The sending process checks to see if the memory page containing the message to be sent is 
currently in physical memory. If not, it is paged in. (See “Avoid Virtual Memory Paging” on 
page 8-3 for information on paging.) 

2. The sender checks to see how much memory it thinks is available in system message buffers on 
the receiver and sends the appropriate number of bytes (see “System Message Buffers” on page 
8-16 for details). If the whole message has been sent at this point, the send is complete; 
otherwise, it waits for a request from the receiver for the next part of the message. This waiting 
may or may not block the sending process, depending on whether the sending call was 
synchronous or asynchronous. 

3. When the message (or the first part of the message) arrives at the receiving node, the node’s 
operating system checks to see if there is a receive posted for a message of that type by the 
specified receiving process. (“Posted” means that the process has an outstanding 
message-receiving call that has not yet been fulfilled.) 

A. If there is a receive posted and the application buffer (the buffer specified in the receive 
call) is currently paged in, the message is stored directly into the application buffer. 

B. If there is a receive posted and the specified application buffer is not paged in, the message 
is stored in a system buffer. Then the application buffer is paged in and the message is 
copied into it. 

C. If there is no receive posted, the message is stored in a system buffer and the receiver waits 
until a receive is posted. When the receive is posted, the specified application buffer is 
paged in (if necessary) and the message is copied into it. 

4. If the whole message has been received at this point, the receive is complete; otherwise, it sends 
a request to the sender for the next part of the message and waits. (This waiting may or may not 
block the receiving process, depending on whether the receiving call was synchronous or 
asynchronous.) The request also includes the current free space in system message buffers on 
the receiver, which is used to calculate how big the next part should be. Go back to step 1 and 
continue until the message has been completely sent and completely received. 

Special case: if the sender thinks the message is too large to send all at once, and there is no receive 
posted, but there is actually enough space in system buffers on the receiver to accommodate the 
entire message, the receiver stores the first part of the message in a system buffer and immediately 
sends a special request to the sender saying “send the whole rest of the message.” In this case, the 
entire message can be sent before the receive is posted. Otherwise, only the first part of the message 
is sent and the rest of the message waits on the sender until the receive is posted. 
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Process Locking 

The message-passing flow control procedures check to make sure that application buffers are paged 
into physical memoty, and copy information from oik buffer to another if they are not. You can 
avoid these steps by using the -plk switch on the application command line. This switch locks parts 
of each process into physical memory, like the OSF/1 system call plockO (see the OSFI1 
Programmer’s Reference for information on plockO). This locking is also referred to as wiring. 

The -plk switch locks the following parts of your application into physical memory: 

• The entire data segment (the part of memory that contains global variables) is locked. This 
occurs when the program is loaded. 

• If you use an application buffer that is located on the stack (the part of memory that contains 
local variables) or on the heap (the part of memory that is allocated by mallocO or 
ALLOCATE), the area from the beginning of the stack/heap to the end of the buffer is locked. 
This occurs the first time you use the buffer in a message-sending or -receiving call. 

All areas of memory not mentioned in this list, including the code segment (the part of memory that 
contains executable instructions), are not locked and are still subject to paging. Note that locking is 
done a page at a time: to lock a single byte, the system must lock the entire 8K-byte virtual memory 
page containing that byte. 

The -plk switch greatly reduces the effect of virtual memory on your application and improves 
message-passing latency. However, it has the following consequences: 

• Your application must fit in physical memory. If it does not, any operation that results in the 
allocation or locking of more memory may fail unexpectedly, possibly terminating the 
application. 

Ideally, the operating system and the application’s code and data should all fit into the node’s 
physical memory at once. However, the code segment is subject to paging even when -plk is in 
effect, so the application may still work if there is enough physical memory left over after 
subtracting the size of the operating system and the total amount of locked data. The definition 
of “enough” depends on the application’s pattern of access to its code in memory and how much 
of the code needs to be present in memory at once. (See the Paragon™ System Software Release 
Notes for the Paragon™ XPIS System for information on how much memory is needed by the 
operating system.) 

• The physical memory available for other processes is reduced by the size of your application’s 
locked data for the life of your application. 

When -plk is in effect, none of the locked data will be removed from physical memory until the 
application terminates. Even if your application is not actually executing (for example, because 
it is “rolled out” by gang scheduling), it still retains control of this memory, and the application 
that is currently executing cannot use the memory that is locked by your application. This can 
cause the other application to run very slowly or “thrash.” 
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To prevent this “thrashing,” your system administrator can configure the system so that -plk cannot 
be used in gang-scheduled or standard-scheduled partitions. If -plk is allowed at your site, it should 
be used with extreme care because of its impact on other users. 

-plk also conditions message-passing flow control to run more efficiently by assuming that all 
message buffers are locked into memory. For example, suppose a message is too large to send all at 
once. With -plk, after the first part of the message arrives, the receiver can request the entire rest of 
the message—no matter how big it is—as soon as the receive is posted. Since the application buffer 
cannot be paged out, the receiver can be sure it will be there to receive the rest of the message when 
it arrives. Without -plk, the application buffer could be paged out while the second part is on its way, 
so the second and subsequent parts of the message must be smaller than the available system 
message buffer space. This means that many more exchanges might be required before the message 
is completely received. 


Packetization 

Messages from one node to another may be broken into smaller messages, called packets , before 
they are placed on the mesh. Using packets results in a slight additional overhead on large messages, 
but it gives much better overall message bandwidth because it allows several large messages to be 
interleaved on the same wire at the same time. 

The maximum size of each packet is referred to as packet size, and is 1792 bytes by default (this 
size does not include the header appended to each packet). If a message is larger than packetsize, it 
is sent in several pieces, each at most packet size bytes long. You can change the packet size with 
the -pkt switch on the application command line. 


System Message Buffers 

In each node process, an area of memory is set aside for system message buffers. These buffers are 
used to store messages that arrive at the node before the receiving process is ready to receive them. 
For example, if a sending process calls csend() before the receiving process has called the 
corresponding crecvO, the message goes into a system buffer in the receiving process. Then, when 
the receiving process does call crecv(), the message is copied from the system buffer to the buffer 
specified in the crecv() call, which is referred to as the application message buffer. 

The size and behavior of the system message buffers are controlled by several parameters that you 
can set on the application command line. The following list describes these parameters and their 
effects. 

• The total amount of memory allocated to system message buffers in each process is referred to 
as message_buffer. The message_buffer is always wired into physical memory, which means 
that it can never be paged out (see “Avoid Virtual Memory Paging” on page 8-3 for information 
on paging). This is necessary to ensure that all messages that arrive at the node can be stored 
somewhere, even if the rest of the application is paged out. The default value of message_buffer 
is 1152K bytes. You can change this size with the -mbf switch on the application command line 
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• The message ^buffer is divided into an area for messages from any process, and a series of areas 
dedicated to messages from particular processes. The number of dedicated areas is referred to 
as correspondents, and the size of each dedicated area is referred to as memory each. When a 
message is received, it is stored in the open area if there is room; the memory_each areas are 
used only when the open area is full. The default value for correspondents is numnodesO; you 
can change it with the -noc switch. The default value of memory each is determined by the 
current values of correspondents, message-buffer, and packet_size; you can change it with the 
-mea switch. 

• Each node process maintains a value called the send_avail for each other process in the 
application. The sendavail is the maximum amount of memory that the process can depend on 
to be available in its memory each segment in that other process. (The sendjtvail may be 
smaller than the actual amount of memory available, but is never larger.) When a process sends 
a message to another process, it decreases the send avail for that process by the message size; 
when the message has been “consumed” (completely placed in the application buffer on the 
receiving process), the receiver tells the sender and the sender increases its sendjtvail for the 
receiving process accordingly. The initial value of sendavail is memory jack, the sendavail 
value is maintained dynamically by each process and cannot be set on the command line. 

• When a process has a large message to send, it uses its sendjtvail value for the receiving 
process to determine how much of the message to send at first. Two parameters called 
send threshold and sendjount control this behavior 

1. If the send avail value is equal to or greater than the size of the message, the sender sends 
the whole message at once. 

2. Otherwise, if the sendjtvail value is equal to or greater than the sendjhreshold, the sender 
sends the first sendjount bytes of the message and waits for an acknowledgment from the 
receiver that they have been consumed before proceeding. 

3. Otherwise, if the sendjtvail value is equal to or greater than the packet size, the sender 
sends the first packet of the message and waits for an acknowledgment from the receiver 
that it has been consumed before proceeding. 

4. If the sendjtvail value is less than packetjize, the send blocks until the receiver tells it that 
some messages have been consumed and send_avail can be increased. (See the discussion 
of give jhreshold later in this section for more information on how this occurs.) This 
blocking may or may not block the sending process, depending on whether it used a 
synchronous send (such as csend()) or an asynchronous send (such as isendO). 

Note that deadlock can occur when sendjtvail is less than packet size. For example, 
suppose that node A and node B’s system message buffers are both full. Under normal 
circumstances, eventually a receive would be posted and the buffered messages would be 
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consumed. But if the two nodes try to exchange messages with synchronous calls, they 
deadlock: A blocks waiting for more space to become available on B, and B blocks waiting 
for more space to become available on A. 

The default value for sendjhreshold and send_count is half of memory_each\ you can change 
these two parameters with the -sth and -set switches respectively. 

• When a message is consumed, the receiver normally informs the sender that the space occupied 
by the message is available for new messages by “piggy-backing” information on other 
messages going to the sender. However, if there are no such messages, the sender can get out of 
date and stop sending messages because it thinks there is no free memory left for it on the 
receiver. In this case, a parameter called the givejhreshold comes into play. If a receiver knows 
that a sender thinks it has less than give threshold bytes of memory free, but there is really more 
memory available, it sends a special message to the sender telling it how much memory is really 
available. The default value for give threshold is packet size; you can change it with the -gth 
switch. 


Message-Passing Configuration Switches 

The switches that control the message-passing configuration parameters discussed earlier in this 
section are referred to as the mp_switches. Although the default values of these parameters have been 
chosen to give good results for “typical” applications, you may be able to improve your application’s 
message-passing performance by using different values. 

You use the mp switches on the command line of a parallel application. These switches override the 
default values of the specified parameters for that run of the application; they do not have any effect 
on other runs or other applications. 

If the application was linked with the -nx switch, the mp_switches are automatically interpreted and 
removed from the command line before the application starts up. An application linked with -tax 
controls its own execution with system calls, as discussed under “Managing Applications” on page 
4-2. Such an application may or may not obey the mp_switches, depending on how it was 
programmed. 

The values used with the mp_switches (except -plk and -noc) are integer numbers of bytes. The 
default, maximum, and minimum values for these switches are described under “Default, Maximum, 
and Minimum Values” on page 8-20. The value you specify may be rounded up or down to ensure 
correct operation, as described under “Dependencies and Rounding” on page 8-21. 
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Summary of the Message-Passing Configuration Switches 

The list of available mp_switches is as follows: 


-plk 

Locks parts of the application into memory (see “Process 
Locking” on page 8-15 for more information). 

-pkt packetsize 

Sets the size of each packet. 

-mbf message_buffer 

Sets the total amount of memory allocated to message 
buffers in each process. 

-noc correspondents 

Sets the total number of other processes from which each 
process expects to receive messages. 

-mex memoryjexport 

Used in setting the maximum value for memory each', 
otherwise ignored. 

-mea memoryeach 

Sets the amount of memory allocated to buffering 
messages from each correspondent. 

-sth sendthreshold 

Sets the threshold for sending multiple packets. 

-set send-Count 

Sets the number of bytes to send right away when the 
available memory is above send threshold. 

-gth givejhreshold 

Sets the threshold for “give me more messages” message. 


See “Process Locking” on page 8-15, “Packetization” on page 8-16, and “System Message Buffers” 
on page 8-16 for more detailed information. 
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Default, Maximum, and Minimum Values 

The default, maximum, and minimum values for the mp_switches are shown in Table 8-1. 


Table 8-1. Message-Passing Configuration Switches 


Switch 

Parameter 

Default 

Maximum 

Minimum 

-plk 

none 

unlocked 

n/a 

n/a 

-pkt 

packetsize 

1792 or 

(( memoryeach / 2) - 
sizeofixmsgt) 1 ), 
whichever is less 

1792 

sizeof(xmsg_t) 

-mbf 

messageJpuffer 

1MB + 128KB 

32MB + 

(10 * full_packet_size 2 ) 

(8 * sizeof(xmsg_t)) * 

( correspondents + 2) + 

(20 * sizeofixmsgt)) 

-mex 

memoryexport 

message buffer - 128KB 

message buffer - 128KB 

2 * (correspondents +2) * 
(2* full_packet_size ) 

-noc 

correspondents 

numnodesO 

none 

none 

-mea 

memoryeach 

(10 * full_packet jsize) or 
maximum memory each, 
whichever is less 

1MB -31 or 
( memory export 1 2) / 

(correspondents + 2), 
whichever is less 

2* full_packet_size 

-sth 

sendjhreshold 

memory each / 2 

memory each -1 

none 

-set 

sendcount 

memory each / 2 

memoryeach 

packetjsize 

-gth 

givethreshold 

packetsize 

memory each / 2 

packet size 


1. xmsg_t is a type defined in <mcmsglmcmsg_xmsg.h> that defines the message header sent along with 
each packet. The size of this type is currently 64 bytes. 

2. fill_packet_size = packet_size + sizeofixmsg_t). 
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Dependencies and Rounding 

As you can see from Table 8-1, the values for some of the mpjswitches depend on the current values 

of other switches in a circular manner (for example, the default for packetsize depends on the value 

of memory each, while the default for memory each depends on the value of packetjsize). These 

dependencies are resolved using the following procedure: 

1. Set packet_size: If -pkt is specified, round the specified value up to a multiple of 
sizeoRxmsgt). Otherwise, use the default value. 

2. Set message_buffer: If -mbf is specified, round the specified value up to a multiple of 
full_packet_size. Otherwise, use the default value. 

3. Set memory export If -mex is specified, use the specified value. Otherwise, use the default 
value. 

4. Set memory each: If -mea is specified, round the specified value down to a multiple of 
sizeoffxmsgt). Otherwise, round the default value down to a multiple of sizeoffxmsgt). 

5. Check that memory each will hold at least two packets: If (memory_eachJ2) - sizeoffxmsg t) 
is less than packet_size, reset packetsize to the value ((memory_eachJ2) - sizeof(xmsg_t)), 
round the resulting value down to a multiple of sizeof(xmsg_t), then return to step 2. Otherwise, 
continue to step 6. 

6. Set sendthreshold : If -sth is specified, round the specified value down to a multiple of 
packet size. Otherwise, round the default value down to a multiple of packet size. 

7. Set send count If -set is specified, round the specified value down to a multiple of packetsize. 
Otherwise, round the default value down to a multiple of packet size. 

8. Set givejhreshold: If -gth is specified, round the specified value down to a multiple of 
packetjsize. Otherwise, round the default value down to a multiple of packetjsize. 


Recommendations 

Because of the way message-passing flow control works, you should try to do all the following to 
achieve the best possible message-passing performance: 

• Avoid paging, by keeping the application’s memory requirements within available physical 
memory. Once you have done this, use the -plk switch if this is allowed at your site. 

• Avoid blocking, by using asynchronous calls. 

• Avoid system message buffering, by posting receives before the message is sent. (See “Avoid 
Message Buffering” on page 8-11 for tips on how to do this.) 
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It is important to make all three of these changes if possible. For example, even if you always post 
receives before the corresponding send occurs, system message buffering will still be necessary if 
the application buffer is paged out (or has never been paged in) when the message arrives. 

If you cannot avoid system message buffering, you may be able to improve message-passing 
performance by increasing the message_buffer parameter (-mbf). This parameter determines the 
total amount of memory allocated to message buffers in each process; the other parameters 
determine how this memory is divided up. When you change the value of message buffer, the 
defaults for the other parameters are automatically scaled to match the current message_buffer size. 
Increasing the message buffer can increase the efficiency of message passing, but it also increases 
the memory usage of your application, which may cause paging and slow the application down. 
Once you have determined the optimal message buffer size for your application, you can change the 
other parameters to fine-time the usage of memory within the message buffer and optimize 
message-passing performance. 

The performance of some applications that use system message buffering can also be improved by 
reducing the correspondents parameter (-noc). This is particularly likely to help if your application 
slows down or hangs when you run it on more nodes. The -noc switch sets the “number of 
correspondents” for each process, which is the number of other processes from which the process 
receives messages. This number is used to determine how the memory allocated to buffering 
messages is divided up; more correspondents means that less memory is available for buffering 
messages from each correspondent. If you don’t use -noc, the default for correspondents is 
numnodesO; that is, it is assumed that each process may receive messages from one process on each 
node. If you know that each process does not receive messages from every other node, using -noc to 
decrease the value of correspondents increases the memory each buffer size, which can result in 
more efficient message passing (especially if the number of nodes is large). However, if the total 
number of other processes from which any process receives messages during the life of the 
application exceeds the value of correspondents , the application may run more slowly. 

Note that certain global operations, such as sending to node -1 (which broadcasts a message to all 
nodes in the application) or calling gdsum(), can send messages to intermediate nodes. For example, 
sending to node -1 does not simply send one message to every other node; instead, it sends a message 
to several other nodes, which each send messages to several other nodes, and so on in a “message 
tree.” This method is more efficient, but it means that if you use any global operations, the actual 
number of correspondents will be greater than the number of nodes from which each node receives 
explicit messages (by approximately the log of the number of nodes in the application). 
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I/O Performance 

If your application performs I/O to files, you can use the following techniques to improve its I/O 
performance. Note: the term request size refers to the number of bytes specified in a single read or 
write operation. Techniques discussed in this section include: 

• Use PFS file systems. 

• Use gopenO instead of open(). 

• Use parallel I/O calls. 

• Use asynchronous calls. 

• Use the appropriate I/O mode. 

• Align I/O buffers with virtual memory pages. 

• Read or write whole file system blocks. 

• Make good use of file striping. 

See Chapter 5 for more information on these techniques. 


Use PFS File Systems 

Always store large data files in file systems of type PFS (Parallel File System). These file systems 
are optimized for large I/O requests (request sizes of 64K bytes or more) and simultaneous access 
by multiple nodes, and files in them can be larger than 2G bytes in size. 


Use gopenO Instead of openO 

If all nodes in an application open the same file, you should always use gopenO rather than open(). 
If all nodes call open(), each node sends an “open file” message to the same I/O node at the same 
time, which can swamp the I/O node with messages. But when all nodes call gopenO, only one node 
communicates with the I/O node; the open file descriptor is then broadcast to the other nodes in the 
application through efficient global communication techniques. If you must use openO, try to keep 
all the nodes from calling it at the same time (do not precede the open() with a gsyncO). 
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Use Parallel I/O Calls 

If you program in Fortran, you should always use Paragon OSF/1 parallel HO calls , such as cread(), 
to access your files. These calls give much better performance than the standard Fortran file I/O 
statements, such as READ. 

If you program in C, you will not see any I/O performance increase from using parallel I/O calls, 
such as cread(), rather than standard UNIX I/O calls, such as read() (although cread() gives better 
performance than fread(), which is the C equivalent of Fortran’s READ). However, you may be able 
to improve computational performance by using asynchronous I/O calls. 


Use Asynchronous Calls 

The parallel I/O calls are available in both synchronous versions (call names beginning with c) and 
asynchronous versions (call names beginning with i). Synchronous calls do not return until the I/O 
operation is complete; asynchronous calls return immediately, giving you an I/O ID that you can use 
to check when the operation is complete. 

Although the synchronous calls are easier to use and have slightly lower overhead, you should use 
the asynchronous calls whenever the results of the call are not needed immediately. Using 
asynchronous calls can let your application do useful computation in the time when it would 
otherwise just be waiting for a large I/O operation to complete. 


Use the Appropriate I/O Mode 

When you use Paragon OSF/1 parallel I/O calls, you can choose from five I/O modes (M_UNIX, 

MJLOG, M_S YNC, M RECORD, and M_GLOBAL), each of which is optimized for a particular 

pattern of file I/O. Be sure to use the correct I/O mode for your application’s usage. In particular 

• Don’t use M_UNIX, the default I/O mode, unless your application depends on its semantics. 

• If all nodes read the same data from the same file at the same time, use M_GLOBAL. 

• If all nodes read or write the same file, but each node is accessing a different part of the file, use 
M_RECORD if at all possible. This mode provides much higher multi-node performance than 
the other modes, all of which force reads and writes from different nodes to the same file to be 
performed in strict sequential order (this is required to preserve standard UNIX I/O semantics, 
but slows the application down). 

• If M_RECORD cannot be used because the I/O request size is not constant across all compute 
nodes, use M SYNC instead. 


8-24 



Paragon™ User's Guide 


improving Performance 


Align I/O Buffers with Virtual Memory Pages 

Try to ensure that memory buffers used in I/O calls are aligned on an 8K-byte boundary whenever 
possible, to align with the operating system’s virtual memory page size. This alignment is 
particularly important in scatter/gather operations with large request sizes to multiple I/O nodes. If 
you do not specify properly-aligned buffers, the software must copy the data to new, aligned buffers, 
which decreases performance. 

To be sure the buffer is well-aligned, you should use malloc() to allocate it rather than allocating it 
statically. (This call is available in both Fortran and C.) Buffers allocated with the standard mallocO 
or its derivatives will always be aligned on a 32-byte boundary. See “Align Application Buffers” on 
page 8-12 for more information on aligning buffers. 


Read or Write Whole File System Blocks 

Disk space is allocated and managed in units called file system blocks. The size of each block in a 
file system is determined when the file system is created. For best performance, PFS file systems 
should have a file system block size of 64K bytes. 

The file system block size is important because files always begin at a block boundary and data is 
most efficiently transferred to and from the physical disk in integer numbers of blocks. Furthermore, 
if a block is modified (but not entirely overwritten) by a write operation, the block may have to be 
read, modified in memory, and then written back. 

Because of this, you will get the best I/O performance if each read or write request begins on a block 
boundary (a multiple of the block size from the beginning of the file) and the request size is a 
multiple of the file system block size. 

To determine the block size of a file system, you can use the statfsO or fstatfsO call (see the OSFI1 
Programmer’s Reference for information on these calls), or ask your system administrator. 


Make Good Use of File Striping 

Files in PFS file systems are distributed, or striped, across several directories called stripe 
directories. The number of stripe directories in a PFS file system is called the stripe factor , and the 
amount of data from each file that is stored in each directory is called the stripe unit. The product of 
the stripe factor and the stripe unit is called the full stripe size. A PFS file system’s stripe factor and 
stripe unit are set by the system administrator when the PFS file system is mounted. 
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Each stripe directory is typically on a separate disk, and each disk is typically controlled by a 
separate I/O node; you get the best I/O performance when you keep all the I/O nodes busy at once. 
You can use file striping to help you do this, with two different methods: 

1. Use a request size equal to an integer multiple of the full stripe size, and make the starting 
address of each request the beginning of a full stripe. With this method, each I/O request goes 
to all the I/O nodes at once. This method can be used on any number of nodes. 

2. Use a request size equal to the stripe unit size, make the starting address of each request the 
beginning of a stripe unit, and choose the starting address of each node’s requests so that the 
nodes’ requests are evenly distributed among the I/O nodes. With this method, each I/O request 
goes to just one I/O node, but the application’s I/O requests are distributed among the I/O nodes. 
This method should be used only if the number of compute nodes is greater than or equal to the 
number of I/O nodes, preferably an integer multiple of the number of I/O nodes. 

These two methods are illustrated in Figure 8-1. Note that method 1 uses fewer, larger requests and 
method 2 uses more, smaller requests. Method 1 is generally more efficient, but method 2 may give 
better performance for some situations (depending on the number of compute nodes, the number of 
I/O nodes, the amount of memory on each I/O node, and the size and frequency of requests). If 
possible, you should try both methods and use whichever is more efficient. The example program in 
/ usrlshare!examples!clstripe demonstrates the two methods, and you can use it to help you 
determine which method is best for your application. (Note that you will see more consistent results 
from one run to the next if the data size is large—8M bytes per node or more.) 

You should always use the I/O mode M_RECORD when using these methods. M_RECORD is the 
most efficient I/O mode for this type of I/O, and automatically enforces the distribution of data 
among the I/O nodes. If you use M_RECORD, no file pointer calculation or seeking is required. 
For example: 

fd = gopen(file, 0_WR0NLY|0_CREATI0_TRUNC, M_RECORD, 0666); 

while(data < end) { 

cwrite(fd, data, request_size); 
data += request_size; 

} 

Using this code, if requestjsize is equal to the full stripe size, each compute node automatically 
accesses all I/O nodes on each write (method 1). Alternatively, if requestjsize is equal to the stripe 
unit, each compute node automatically accesses exactly one I/O node on each write (method 2). 

To determine the stripe factor and stripe unit of a PFS file system, you can use the showfs command 
(described under “Displaying File System Attributes” on page 5-5) or the statpfsO or fstatpfsO call 
(available only in C; described under “Getting Information About PFS File Systems” on page 5-39). 
The example program in lusrl share! examples! cl stripe shows you how you can do this with 
fstatpfsQ. 
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Figure 8-1. Two Methods of Improving I/O Performance with MRECORD 
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This appendix summarizes the commands and system calls of Paragon™ OSF/1. The complete 
syntax of each command and call is provided, along with a brief description of each. The C and 
Fortran versions of the calls are discussed in separate sections. 

This appendix discusses only the commands and calls that are specific to Paragon OSF/1. For 
information on the standard commands and calls of OSF/1, see the OSF/1 Command Reference and 
OSF/1 Programmer’s Reference. 


Command Summary 

This section summarizes the commands discussed in Chapter 2 and Chapter 5. See the Paragon ™ 
Commands Reference Manual for more information on these commands. 


Compiling and Linking Applications 


Table A-l. Commands for Compiling and Linking Applications 


Command Synopsis 

Description 

cc -nx [ switches ] sourcefile... 

Compile a Paragon OSF/1 application written 
in C on an Intel supercomputer. 

f77 -nx [ switches ] sourcefile... 

Compile a Paragon OSF/1 application written 
in Fortran on an Intel supercomputer. 

icc -nx [ switches ] sourcefile... 

Compile a Paragon OSF/1 application written 
in C on an Intel supercomputer or 
cross-development workstation. 

if77 -nx [ switches ] sourcefile ... 

Compile a Paragon OSF/1 application written 
in Fortran on an Intel supercomputer or 
cross-development workstation. 
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Running Applications 


Table A-2. Commands for Running Applications 


Command Synopsis 

Description 

application [ -sz size 1 -sz hXw 1 -nd hXw:n ] 

[ -pri priority ] [ -pt ptype ] 

[ -on nodespec ] [ -pn partition ] 

[ -pkt packet jsize ] 

[ -noc correspondents ] 

[ -mbf memory buffer ] 

[ -mex memory export ] 

[ -mea memory each ] 

[ -sth sendjhreshold ] [ -set sendcount ] 

[ -gth givejhreshold ] [ -plk ] 

[ applicationargs ] [ \file [ -pt ptype ] 

[ -on nodespec ] [ application_args ] ]... 

Execute a Paragon OSF/1 application. 


Managing Partitions 


Table A-3. Commands for Managing Partitions 


Command Synopsis 

Description 

mkpart [ -sz size 1 -sz hXw 1 -nd nodespec ] 

[ -ss 1 [ [ -sps 1 -rq time ] 

[ -epl priority ] ] ] [ -mod mode ] 
partition 

Create a partition. 

impart [ -f ] [ -r ] partition 

Remove a partition. 

showpart [ -f ] [ partition ] 

Show the characteristics of a partition. 

Ispart [ -r ] [partition ] 

List the subpartitions of a partitioa 

pspart [*r][ partition ] 

List the applications in a partition. 

chpart [ -epl priority ] [ -g group ] 

[ -mod mode ] [ -nm name ] 

[ -o owner [. group ] ] [ -rq time 1 -sps ] 
partition 

Change certain partition characteristics. 
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Parallel File System Commands 


Table A-4. Parallel File System Commands 


Command Synopsis 

Description 

showfs [ -k ] [ -t type ] 

[filesystem 1 directory ] 

Display file system attributes. 

lsize [ -a ] size file... 

Change the size of a file or files. 


Miscellaneous Commands 

Note: the commands shown in Table A-5 are not documented in this manual. 


Table A-5. Miscellaneous Commands 


Command Synopsis 

Description 

fsplit [filename ] 

Split one file containing several Fortran 
program units into several files containing one 
program unit each. (See the Paragon™ 
Commands Reference Manual for more 
information.) 

pmake [ -bcdeFikmnNpqrsStuUvw ] 

[ -C dir ] [ -f file ] [ -I dir ][-j[jobs]] 

[ -1 [ load ] ] [ -o file ] [ -P partition ] 

[ -W file ] [ macro definition... ] 

[ target ... ] 

Parallel make utility that maintains up-to-date 
versions of target files and performs shell 
programs in parallel. (See the Paragon™ 
Application Tools User’s Guide for more 
information.) 

sat [ -bchxV ] f -d dir ] [ -1 log ] [ -m mins ] 

[ -o output ] [ -p partition ] [ -r reps ] 

[ test... ] 

Run the Paragon system acceptance test. (See 
the Paragon System Acceptance Test User's 
Guide for more information.) 
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C System Call Summary 

This section summarizes the C versions of the system calls discussed in Chapter 3, Chapter 4, 
Chapter 5, and Chapter 6. See the Paragon ™ C System Calls Reference Manual for more 
information on these calls. 


Process Characteristics 


Table A-6. C Calls for Process Characteristics 


Synopsis 

Description 

long mynode(void); 

Obtain the calling process’s node number. 

long numnodes(void); 

Obtain the number of nodes allocated to the 
current application. 

long myptype(void); 

Obtain the calling process’s process type. 

void setptype( 

Set the calling process’s process type (only 

long ptype): 

permitted if the process type is currently 

INVALIDPTYPE). 

long myhost(void); 

Obtain the controlling process’s node number. 
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Synchronous Send and Receive 


Table A-7. C Calls for Synchronous Send and Receive 


Synopsis 

Description 

void csend( 

Send a message, waiting for completion. 

long type. 


char *buf. 


long count , 


long node , 


long ptype ); 


void crecv( 

Receive a message, waiting for completion. 

long typesel. 


char *buf. 


long count ); 


long csendrecv( 

Send a message and post a receive for the 

long type. 

reply. Wait for completion. 

char *sbuf, 


long scount. 


long node. 


long ptype. 


long typesel. 


char *rbuf. 


long rcount ); 


void gsendx( 

Send a message to a list of nodes, waiting for 

long type. 

completioa 

char *buf. 


long count. 


long node s[]. 


long nodecount ); 
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Asynchronous Send and Receive 


Table A-8. C Calls for Asynchronous Send and Receive 


Synopsis 

Description 

long isend( 

Send a message without waiting for 

long type, 

completion. 

char *buf. 


long count. 


long node. 


long ptype ); 


long irecv( 

Receive a message without waiting for 

long typesel. 

completion. 

char *buf. 


long count ); 


long isendrecv( 

Send a message and post a receive for the reply 

long type. 

without waiting for completion. 

char *sbuf. 


long scount. 


long node. 


long ptype. 


long typesel. 


char *rbuf. 


long rcount ); 


long msgdone( 

Determine whether a send or receive operation 

long mid ); 

has completed. 

void msgwait( 

Wait for completion of a send or receive 

long mid ); 

operation. 

void msgignore( 

Release a message ID as soon as a send or 

long mid ); 

receive operation completes. 

long msgmerge( 

Merge two message IDs into a single ID that 

long midi. 

can be used to wait for completion of both 

long mid2); 

operations. 
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Probing for Pending Messages 


Table A-9. C Calls for Probing for Pending Messages 


Synopsis 

Description 

void cprobe( 

Wait for a message of a selected type to arrive. 

long typesel ); 


long iprobe( 

Determine whether a message of a selected 

long typesel ); 

type is pending. 


Getting Information About Pending or Received Messages 


Table A-10. C Calls for Getting Information About Pending or Received Messages 


Synopsis 

Description 

long infocount(void); 

Return size in bytes of a pending or received 
message. 

long infonode(void); 

Return node number of the node that sent a 
pending or received message. 

long infoptype(void); 

Return process type of the process that sent a 
pending or received message. 

long infotype(void); 

Return message type of a pending or received 
message. 


A-7 



















Summary of Commands and System Calls 


Paragon™ User’s Guide 


Treating a Message as an Interrupt 


Table A-ll. C Calls for Treating a Message as an Interrupt 


Synopsis 

Description 

void hsend( 
long type, 
char *buf, 
long count, 
long node, 
long ptype, 
void (* handler) ()); 

Send a message and set up a handler procedure 
to be called when the send completes. 

void hrecv( 
long typesel, 
char *buf, 
long count, 
void (* handler ) ()); 

Receive a message and set up a handler 
procedure to be called when the receive 
completes. 

void hsendrecv( 
long type, 
char *sbuf, 
long scount, 
long node, 
long ptype, 
long typesel, 
char *rbuf, 
long rcount, 
void (* handler) ()); 

Send a message and post a receive for the 
reply. Set up a handler procedure to be called 
when the reply arrives. 

long masktrap( 
long state ); 

Enable or disable interrupts for message 
handlers. Required to prevent corruption of 
global variables. 

void hsendx( 
long type, 
char *buf, 
long count, 
long node, 
long ptype, 
void (*j chandler) (), 
long hparam); 

Send a message and set up an extended handler 
procedure to be called with the value hparam 
when the send completes. Allows handler 
sharing. 
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Extended Receive and Probe 


Table A-12. C Calls for Extended Receive and Probe 


Synopsis 

Description 

void crecvx( 

Receive a message of a specified type from a 

long typesel, 

specified sending node and process type. 

char *buf. 

together with information about the message. 

long count y 

Wait for completion. 

long nodeseU 


long ptypeseU 


long info []); 


longirecvx( 

Receive a message of a specified type from a 

long typeseU 

specified sending node and process type. 

char *buf. 

together with information about the message. 

long county 

Do not wait for completion. 

long nodesely 


long ptypesel. 


long info []); 


void hrecvx( 

Receive a message of a specified type from a 

long typesel. 

specified sending node and process type. Set 

char *buf. 

up an extended handler procedure to be called 

long count. 

with information about the message and the 

long nodesel. 

value hparam when the receive completes. 

long ptypesel. 


void ( *xhandler ) (), 


long hparam ); 


void cprobex( 

Wait for a message of a specified type from a 

long typesel. 

specified sending node and process type. 

long nodesel. 

Return information about the message. 

long ptypesel. 


long in/o[]); 


long iprobex( 

Determine whether a message of a specified 

long typesel. 

type from a specified sending node and process 

long nodesel. 

type is pending. If it is, return information 

long ptypesel. 

about the message. 

long info []); 
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Global Operations 


Table A-13. C Calls for Global Operations (1 of 3) 


Synopsis 

Description 

void gcol( 

Concatenation. 

char jc[]. 


long xlen. 


char y[]. 


long ylen. 


long *ncnt ); 


void gcolx( 

Concatenation for contributions of known 

char 41. 

length. 

long xlens[]. 


char 4]); 


void gdhigh( 

Vector double precision MAX. 

double 4L 


long n. 


double work[]); 


void gdlow( 

Vector double precision MIN. 

double 4], 


long «, 


double work []); 


void gdprod( 

Vector double precision MULTIPLY. 

double 4 L 


long n, 


double work []); 


void gdsum( 

Vector double precision SUM. 

double 4]. 


longn. 


double wor4]); 


void giand( 

Vector integer bitwise AND. 

long 4], 


longn. 


long work []); 


void gihigh( 

Vector integer MAX. 

long 4], 


longn. 


long worfe[]); 
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Table A-13. C Calls for Global Operations (2 of 3) 


Synopsis 

Description 

void gilow( 

Vector integer MIN. 

long 4], 


long n. 


long work {]); 


void gior( 

Vector integer bitwise OR. 

long 4], 


longn. 


long work []); 


void giprod( 

Vector integer MULTIPLY. 

long 4], 


long n. 


long hw4] ); 


void gisum( 

Vector integer SUM. 

long 4], 


longn. 


long work []); 


void gland( 

Vector logical AND. 

long 4], 


long n. 


long work {}); 


void glor( 

Vector logical inclusive OR. 

long 4], 


longn. 


long worki ]); 


void gopf( 

Arbitrary commutative function. 

char 41. 


long xlen. 


char wor/c[]. 


long ( *fimction)Q ); 


void gshigh( 

Vector real MAX. 

float 4]. 


longn. 


float hw4] ); 


void gslow( 

Vector real MIN. 

float 41. 


long n. 


float worfcn); 
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Table A-13. C Calls for Global Operations (3 of 3) 


Synopsis 

Description 

void gsprod( 

Vector real MULTIPLY. 

float x[]. 


long n. 


float work[ ]); 


void gssum( 

Vector real SUM. 

float 41. 


long n, 


float hw&[] ); 


void gsync(void); 

Global synchronization. 


Controlling Application Execution 


Table A-14. C Calls for Controlling Application Execution (1 of 2) 


Synopsis 

Description 

long nx_initve( 
char * partition, 
long size, 
char * account, 
long *argc, 
char *argv []); 

Create a new application. 

long nx_initve_rect( 
char *partition, 
long anchor, 
long rows, 
long cols, 
char * account, 
long *argc, 
char *argv[])\ 

Create a new application with a rectangular 
shape. 

long nx_pri( 
long pgroup, 
long priority ); 

Set the priority of an application. 

long nx_nfork( 
long nodeJist[], 
long numnodes, 
long ptype, 
long pid_list []); 

Copy the current process onto some or all 
nodes of an application. 
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Table A-14. C Calls for Controlling Application Execution (2 of 2) 


Synopsis 

Description 

long nx_load( 

long node_list[], 
long numnodes, 
long ptype, 
long pidjist [], 
char ^pathname ); 

Execute a stored program on some or all nodes 
of an application. 

long nx_loadve( 

long nodeJistU, 
long numnodes , 
long ptype, 
long pid_list[], 
char *pathname, 
char *argv[], 
char *envp[ ]); 

Execute a stored program on some or all nodes 
of an application, with specified argument list 
and environment. 

long nx_waitalI(void); 

Wait for all application processes. 


Getting Information About Applications 


Table A-15. C Calls for Getting Information About Applications 


Synopsis 

Description 

long nx_app_rect( 
long *rows, 
long *cols ); 

Obtain the height and width of the rectangle of 
nodes allocated to the current application. 

int nx_app_nodes( 
pid_t pgroup, 
nx_nodes_t *node_list, 
unsigned long *list_size ); 

List the nodes allocated to an application. 

int nxjpspart( 
char *partition, 
nx_pspart_t **pspart_list, 
unsigned long *list_size ); 

Obtain information about all applications and 
active subpartitions in a partition. 
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Partition Management 


Table A-16. C Calls for Partition Management (1 of 2) 


Synopsis 

Description 

long nx_mkpart( 

char * partition , 
long size, 
long type ); 

Create a partition with a particular number of 
nodes. 

long nx_mkpart_rect( 

char *partition, 
long rows, 
long cols, 
long type ); 

Create a partition with a particular height and 
width. 

long nx_mkpartjmap( 

char * partition , 
long numnodes, 
long nodeJistU , 
long type ); 

Create a partition with a specific set of nodes. 

long nx_rmpart( 

char *partition, 
long force, 
long recursive ); 

Remove a partition. 

int nxjpart_attr( 

char *partition, 
nx_part_info_t * attributes ); 

Get a partition’s attributes. 

int nxjpart_nodes( 
char *partition, 
nx_nodes_t *node_list, 
unsigned long *list_size); 

List the root node numbers for the nodes of a 
partition. 

long nx_chpart_name( 
char *partition, 
char *name ); 

Change a partition’s name. 

long nx_chpart_mod( 
char * partition, 
long mode ); 

Change a partition’s protection modes. 

long nx_chpart_epl( 

char *partition, 
long priority ); 

Change a partition’s effective priority limit 

long nx_chpart_rq( 
char * partition, 
long rollin_quantum); 

Change a partition’s rollin quantum. 
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Table A-16. C Calls for Partition Management (2 of 2) 


Synopsis 

Description 

long nx_chpart_owner( 

char *pa,rtition, 
long owner, 
long group ); 

Change a partition’s owner and group. 

long nx_chpart_sched( 

char *partition, 
int schedjype ); 

Change a partition’s scheduling type. 


Finding Unusable Nodes 


Table A-17. C Calls for Finding Unusable Nodes 


Synopsis 

Description 

int nx_empty_nodes( 

nx_nodes_t *node_list, 
unsigned long *list_size ); 

List the nodes that are empty slots. 

int nx_failed_nodes( 

nx_nodes_t *nodeJist, 
unsigned long *list_size ); 

List the nodes that failed to boot. 


Handling Errors 


Table A-18. C Calls for Handling Errors 


Synopsis 

Description 

_call (...); 

Special version of call that returns error value 
to caller. 

void nx_perror( 

char * string ); 

Print an error message corresponding to the 
current value of errno. 
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Floating-Point Control 


Table A-19. C Calls for Floating-Point Control 


Synopsis 

Description 

int isnan( 

double dsrc ); 

Determine if a double value is Not-a-Number. 

int isnand( 

double dsrc ); 

Determine if a double value is Not-a-Number. 

int isnanf( 

float fsrc ); 

Determine if a float value is Not-a-Number. 

fp_md fpgetround(void); 

Get the floating-point rounding mode for the 
calling process. 

fp_md fpsetround( 
fp_md rndjiir ); 

Set the floating-point rounding mode for the 
calling process. 

fp_except fpgetmask(void); 

Get the floating-point exception mask for the 
calling process. 

fp_except fpsetmask( 
fp_except mask ); 

Set the floating-point exception mask for the 
calling process. 

fp_except fpgetsticky(void); 

Get the floating-point exception sticky flags 
for the calling process. 

fp_except fpsetsticky( 
fp_except sticky ); 

Set the floating-point exception sticky flags for 
the calling process. 


Miscellaneous Calls 


Table A-20. Miscellaneous C Calls 


Synopsis 

Description 

void flick(void); 

Temporarily relinquish the CPU to another 
process. 

void led( 

long state ); 

Turn the node’s green LED on or off. 

double dclock(void); 

Return time in seconds since booting the 
system. 
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iPSC® and Touchstone DELTA Compatibility 


Table A-21. C Calls for iPSC® and Touchstone DELTA Compatibility (1 of 2) 


Synopsis 

Description 

void flusiunsg( 
long typeset, 
long node sel, 
long ptypesel ); 

Flush specified messages from the system. 

long ginv( 

Return the position of an element in the 

long j); 

binary-reflected gray code sequence. Inverse 
of gray(). 

long gray( 

Return the binary-reflected gray code for an 

long j); 

integer. 

void hwdock( 

Place the current value of the hardware counter 

esize_t *hwtime ); 

into a 64-bit unsigned integer variable. 

long infopid(void); 

Return the process type of the process that sent 
a pending or received message. 

void killcube( 
long node, 
long ptype ); 

Terminate and clear node process(es). 

void kiUproc( 
long node, 
long ptype ); 

Terminate a node process. 

void load( 

char *filename, 
long node, 
long ptype); 

Load a node process. 

unsigned long mdock(void); 

Return the time in milliseconds. 

void msgcancd( 

Cancel an asynchronous send or receive 

long mid ); 

operation 

long mypart( 

Obtain the height and width of the rectangle of 

long *rows, 
long *cols ); 

nodes allocated to the current application 

long mypid(void); 

Return the process type of the calling process. 
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Table A-21. C Calls for iPSC® and Touchstone DELTA Compatibility (2 of 2) 


Synopsis 

Description 

long nodedim(void); 

Return the dimension of the current application 
(the number of nodes allocated to the 
application is 2 dimension ). 

long restrictvol( 

long fildes, 
long nvol, 
long vollist []); 

Return 0 (does nothing; provided for 
compatibility only). 


I/O Modes 


Table A-22. C Calls for I/O Modes 


Synopsis 

Description 

int gopen( 

const char *path, 
int oflag, 
int iomode [, 
mode_t mode ]); 

Open a file on all nodes and set its I/O mode. 

void setiomode( 
int fildes, 
int iomode ); 

Set the I/O mode for a file. 

long iomode( 
int fildes ); 

Return the current I/O mode for a file. 
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Reading and Writing Files in Parallel 


Table A-23. C Calls for Reading and Writing Files in Parallel 


Synopsis 

Description 

void cread( 
int fildes, 
char * buffer, 
unsigned int nbytes ); 

Read from a file, waiting for completion. 

void cwrite( 
intfildes, 
char * buffer, 
unsigned int nbytes ); 

Write to a file, waiting for completion. 

void creadv( 
int fildes, 
struct iovec *iov, 
int iovcount ); 

Read from a file to irregularly-scattered 
buffers, waiting for completion. 

void cwritev( 
int fildes, 
struct iovec *iov, 
int iovcount); 

Write to a file from irregularly-scattered 
buffers, waiting for completion. 

long iread( 
int fildes, 
char *buffer, 
unsigned int nbytes ); 

Read from a file without waiting for 
completion. 

long iwrite( 
int fildes, 
char * buffer, 
unsigned int nbytes ); 

Write to a file without waiting for completion. 

long ireadv( 
int fildes, 
struct iovec *iov, 
int iovcount ); 

Read from a file to irregularly-scattered 
buffers, without waiting for completioa 

long iwritev( 
int fildes, 
struct iovec *iov, 
int iovcount ); 

Write to a file from irregularly-scattered 
buffers, without waiting for completioa 

long iodone( 
long id ); 

Determine whether an asynchronous I/O 
operation is complete. If complete, release the 
I/O ID. 

void iowait( 
long id ); 

Wait for completion of an asynchronous I/O 
operation and release the VO ID. 
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Detecting End-of-File and Moving the File Pointer 


Table A-24. C Calls for Detecting End-of-File and Moving the File Pointer 


Synopsis 

Description 

longiseof( 

Test for end-of-file. 

int fildes ); 


off_t lseek( 

Move the read/write file pointer. 

int fildes. 


off_t offset. 


int whence ); 



Increasing the Size of a File 


Table A-25. C Calls for Increasing the Size of a File 


Synopsis 

Description 

long lsize( 

Increase size of a file. 

int fildes. 


off_t offset. 


int whence ); 
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Extended File Manipulation 


Table A-26. C Calls for Extended File Manipulation 


Synopsis 

Description 

esize_t eseek( 
int fildes, 
esize_t offset, 
int whence ); 

Move file pointer in extended file. 

esize_t esize( 
int fildes, 
esize_t offset, 
int whence ); 

Increase size of extended file. 

long estate 
char *path, 
struct estat * buffer ); 

Get status of extended file from pathname. 

long lestat( 
char *path, 
struct estat *buffer ); 

Get status of extended file or symbolic link 
from pathname. 

long festat( 
int fildes, 

struct estat *buffer ); 

Get status of open extended file from file 
descriptor or unit. 

















Summary of Commands and System Calls 


Paragon™ User’s Guide 


Performing Extended Arithmetic 


Table A-27. C Calls for Performing Extended Arithmetic 


Synopsis 

Description 

esize_t eadd( 

Add two extended integers. 

esize_t el , 


esize_t e2 ); 


long ecmp( 

Compare two extended integers. 

esize_t el. 


esize_t e2 ); 


long ediv( 

Divide extended integer by integer. 

esize__t e. 


long n ); 


long emod( 

Give extended integer modulo an integer 

esize_t e. 

(remainder when e is divided by n). 

long n ); 


esize_t emul( 

Multiply extended integer by integer. 

esize_t e. 


longn); 


esize_t esub( 

Subtract two extended integers. 

esize_t el. 


esize_t e2 ); 


void etos( 

Convert extended integer to string. 

esize_t e. 


char *s ); 


esize_t stoe( 

Convert string to extended integer. 

char *s); 
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Getting Information About PFS File Systems 


Table A-28. C Calls for Getting Information About PFS™ File Systems 


Synopsis 

Description 

long getpfsinfo( 

struct pfsmntinfo **attrbufp ); 

Get PFS-specific information about all 
mounted PFS file systems. 

intstatpfe( 

char *path, 

structestatfs *fs_buffer, 
struct statpfs *pfs_buffer, 
unsigned int pfsjbufsize ); 

Get PFS-specific and non-PFS-specific 
informationfor the file system containing path. 

long fstatpfc( 
int fildes, 

struct estatfs *fsj>uffer, 
struct statpfs *pfs_buffer, 
unsigned int pfs bufsize ); 

Get PFS-specific and non-PFS-specific 
information for the file system containing the 
file open on fildes. 
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Managing Pthread Execution 


Table A-29. C Calls for Managing Pthread Execution 


Synopsis 

Description 

int pthread_create( 

pthread_t *thread, 
pthread_attr_t attr, 
void *(*routine)(yovd *arg), 
void *arg ); 

Creates a pthread. 

pthread_t pthread_self(void); 

Returns the ID of the calling pthread. 

int pthread_equal( 

pthread_t threadl, 
pthread_t thread2 ); 

Compares two pthread identifiers. 

void pthread_yield(void); 

Allows the scheduler to run another pthread 
instead of the current one. 

void pthread_exit( 

void * status ); 

Terminates the calling pthread. 

int pthread Join( 

pthread_t thread, 
void **status ); 

Waits for a pthread to terminate. 

int pthread_detach( 

pthread_t *thread ); 

Detaches a pthread. 


Managing Pthread Attributes 


Table A-30. C Calls for Managing Pthread Attributes 


Synopsis 

Description 

int pthread_attr_create( 

pthread_attr_t *attr ); 

Creates a pthread attributes object. 

int pthread_attr_setstacksize( 
pthread_attr_t *attr, 
long stacksize ); 

Sets the value of the stack size attribute of a 
pthread attributes object. 

int pthread_attr_delete( 

pthread_attr_t *attr ); 

Deletes a pthread attributes object. 

int pthread_attrjgetstacksize( 

pthread_attr_t attr ); 

Returns the value of the stack size attribute of 
a pthread attributes object. 
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Managing Mutexes 


Table A-31. C Calls for Managing Mutexes 


Synopsis 

Description 

int pthread_mutex_init( 
pthread_mutex_t *mutex, 
pthread_mutexattr_t attr ); 

Creates a mutex. 

int pthread_mutex_lock( 

pthread_mutex_t *mutex ); 

Locks a mutex. 

int pthreadjmutex_trylock( 
pthread_mutex_t *mutex ); 

Tries once to lock a mutex. 

int pthread_mutex_unlock( 
pthread_mutex_t *mutex ); 

Unlocks a mutex. 

int pthread_mutex_destroy( 
pthread_mutex_t *mutex ); 

Deletes a mutex. 

int pthread_mutexattr_create( 
pthread_mutexattr_t *attr ); 

Creates a mutex attributes object. 

int pthread_mutexattr_deiete( 
pthread_mutexattr_t *attr ); 

Deletes a mutex attributes object. 
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Using Condition Variables to Synchronize Pthreads 


Table A-32. C Calls for Using Condition Variables to Synchronize Pthreads 


Synopsis 

Description 

int pthread_cond_init( 

pthread_cond_t *cond, 
pthread_condattr_t attr ); 

Creates a condition variable. 

int pthread_cond_wait( 

pthread_cond_t *cond, 
pthread_mutex_t *mutex ); 

Waits on a condition variable. 

int pthread_cond_timedwait( 

pthread_cond_t *cond, 
pthread_mutex_t *mutex, 
struct timespec *abstime ); 

Waits on a condition variable for a specified 
period of time. 

int pthread_cond_signal( 

pthread_cond_t *cond ); 

Wakes up a pthread that is waiting on a 
condition variable. 

int pthread_cond_broadcast( 

pthread_cond_t *cond ); 

Wakes up all pthreads that are waiting on a 
condition variable. 

int pthread_cond_destroy( 

pthread_cond_t *cond ); 

Destroys a condition variable. 

int pthread_condattr_create( 

pthread_condattr_t *attr ); 

Creates a condition variable attributes object. 

int pthread_condattr_delete( 

pthread_condattr_t *cutr ); 

Deletes a condition variable attributes object. 


Canceling Pthreads 


Table A-33. C Calls for Canceling Pthreads 


Synopsis 

Description 

int pthread_cancel( 

pthreadjt thread ); 

Requests cancellation of a pthread. 

int pthread_setcancel( 
int state ); 

Enables or disables the general cancelability of 
the calling pthread. 

int pthread_setasynccancel( 

int state ); 

Enables or disables the asynchronous 
cancelability of the calling pthread. 

void pthread_testcancel(void); 

Creates a cancellation point in the calling 
pthread. 
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Pthreads Cleanup Routines 


Table A-34. C Calls for Pthreads Cleanup Routines 


Synopsis 

Description 

void pthread_cleanup_pop( 

ini execute ); 

Removes a routine from the top of the cleanup 
stack of the calling pthread and optionally 
executes it. 

void pthread_deanup_push( 

void (*routine)(void *arg ), 
void *arg ); 

Pushes a routine onto the cleanup stack of the 
calling pthread. 


Managing Pthread Keys 


Table A-35. C Calls for Managing Pthread Keys 


Synopsis 

Description 

int pthread_keycreate( 
pthread_key_t *key, 
void (*destructor)(yovd *value )); 

Creates a key to be used with pthread-specific 
data. 

int pthreadsetspecific ( 
pthread_key_t key, 
void * value ); 

Binds a pthread-specific value to a key. 

int pthread_getspecific( 

pthread_key_t key, 
void **value ); 

Returns the value bound to a key. 


Miscellaneous Pthread Calls 


Table A-36. Miscellaneous Pthread Calls 


Synopsis 

Description 

int pthread_once( 

Calls an initialization routine. 

pthread_once_t *once_blocK 


\oii(*routine)Q ); 


int sigwait( 

Suspends the calling pthread until one of a 

sigset_t *set); 

specified set of signals is received. 
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Fortran System Call Summary 

This section summarizes the Fortran versions of the system calls discussed in Chapter 3, Chapter 4, 
and Chapter 5. See the Paragon ™ Fortran System Calls Reference Manual for more information on 
these calls. 


Process Characteristics 


Table A-37. Fortran Calls for Process Characteristics 


Synopsis 

Description 

INTEGER FUNCTION MYNODE() 

Obtain the calling process’s node number. 

INTEGER FUNCTION NUMNODESO 

Obtain the number of nodes allocated to the 
cunent application. 

SUBROUTINE SETPTYPE(ptyp<?) 

INTEGER ptype 

Set the calling process’s process type (only 
permitted if the process type is currently 

INVALIDPTYPE). 

INTEGER FUNCTION MYFTYPEO 

Obtain the calling process’s process type. 

INTEGER FUNCTION MYHOSTQ 

Obtain the controlling process’s node number. 
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Synchronous Send and Receive 


Table A-38. Fortran Calls for Synchronous Send and Receive 


Synopsis 

Description 

SUBROUTINE CSENIXOP** buf, count, 
node,ptype) 

INTEGER type 

INTEGER buff) 

INTEGER count 

INTEGER node 

INTEGER ptype 

Send a message, waiting for completion. 

SUBROUTINE CREC\(typesel, buf, count) 

INTEGER typesel 

INTEGER buf*) 

INTEGER count 

Receive a message, waiting for completion. 

INTEGER FUNCTION CSENDRECVOype, 
sbuf, scount, node, ptype, typesel, rbuf, 
rcount) 

INTEGER type 

INTEGER sbuf(*) 

INTEGER scount 

INTEGER node 

INTEGER ptype 

INTEGER typesel 

INTEGER rbuf*) 

INTEGER rcount 

Send a message and post a receive for the 
reply. Wait for completion. 

SUBROUTINE GSENDX(/ype, buf, count, 
nodes, nodecount) 

INTEGER type 

INTEGER buf*) 

INTEGER count 

INTEGER nodesi*) 

INTEGER nodecount 

Send a message to a list of nodes, waiting for 
completion. 
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Asynchronous Send and Receive 


Table A-39. Fortran Calls for Asynchronous Send and Receive (1 of 2) 


Synopsis 

Description 

INTEGER FUNCTION ISEND(o?w, buf, 
count, node,ptype) 

INTEGER type 

INTEGER buf*) 

INTEGER count 

INTEGER node 

INTEGER ptype 

Send a message without waiting for 
completion. 

INTEGER FUNCTION IRECY(typesel, buf, 
count) 

INTEGER typesel 

INTEGER bufi*) 

INTEGER count 

Receive a message without waiting for 
completion. 

INTEGER FUNCTION ISENDRECV(/ype, 
sbuf, scount, node, ptype, typesel, rbuf, 
rcount) 

INTEGER type 

INTEGER sbuf*) 

INTEGER scount 

INTEGER node 

INTEGER ptype 

INTEGER typesel 

INTEGER rbuf*) 

INTEGER rcount 

Send a message and post a receive for the reply 
without waiting for completion. 

INTEGER FUNCTION MSGDONE(muf) 

INTEGER mid 

Determine whether a send or receive operation 
has completed. 

SUBROUTINE MSGWAIT(mtt/) 

INTEGER mid 

Wait for completion of a send or receive 
operation. 
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Table A-39. Fortran Calls for Asynchronous Send and Receive (2 of 2) 


Synopsis 

Description 

SUBROUTINE MSGIGN ORE(mid) 

INTEGER mid 

Release a message ID as soon as a send or 
receive operation completes. 

INTEGER FUNCTION MSGMERGEfmutf, 
midi) 

INTEGER midi 

INTEGER mid2 

Merge two message IDs into a single ID that 
can be used to wait for completion of both 
operations. 


Probing for Pending Messages 


Table A-40. Fortran Calls for Probing for Pending Messages 


Synopsis 

Description 

SUBROUTINE CFROSEitypesel) 

INTEGER typeset 

Wait for a message of a selected type to arrive. 

INTEGER FUNCTION IPROBE( 07 Wu>/) 

INTEGER typeset 

Determine whether a message of a selected 
type is pending. 


Getting Information About Pending or Received Messages 


Table A-41. Fortran Calls for Getting Information About Pending or Received Messages 


Synopsis 

Description 

INTEGER FUNCTION INFOCOUNT() 

Return size in bytes of a pending or received 
message. 

INTEGER FUNCTION INFONODEO 

Return node number of the node that sent a 
pending or received message. 

INTEGER FUNCTION EMFOPTYPE() 

Return process type of the process that sent a 
pending or received message. 

INTEGER FUNCTION INFOTYPEQ 

Return message type of a pending or received 
message. 
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Treating a Message as an Interrupt 


Table A-42. Fortran Calls for Treating a Message as an Interrupt (1 of 2) 


Synopsis 

Description 

SUBROUTINE HSEND(/>p£, buf, count, 
node,ptype, handler ) 

Send a message and set up a handler procedure 
to be called when the send completes. 

INTEGER type 

INTEGER buff) 

INTEGER count 

INTEGER node 

INTEGER ptype 

EXTERNAL handler 


SUBROUTINE HRECV {typesel, buf, count, 
handler) 

INTEGER typesel 

INTEGER bufif) 

INTEGER count 

EXTERNAL handler 

Receive a message and set up a handler 
procedure to be called when the receive 
completes. 

SUBROUTINE HSENDRECV(Ope, sbuf, 
scount, node, ptype, typesel, rbuf, rcount, 
handler) 

Send a message and post a receive for the 
reply. Set up a handler procedure to be called 
when the reply arrives. 

INTEGER type 

INTEGER sbuf*) 

INTEGER scount 

INTEGER node 

INTEGER ptype 

INTEGER typesel 

INTEGER rbufl*) 

INTEGER rcount 

EXTERNAL handler 
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Table A-42. Fortran Calls for Treating a Message as an Interrupt (2 of 2) 


Synopsis 

Description 

INTEGER FUNCTION MASKTRAP(stafc-) 

INTEGER state 

Enable or disable interrupts for message 
handlers. Required to prevent corruption of 
global variables. 

SUBROUTINE HSENDX(ope, buf, count, 
node, ptype, xhandler, hparam) 

INTEGER type 

INTEGER bufi*) 

INTEGER count 

INTEGER node 

INTEGER ptype 

EXTERNAL xhandler 

INTEGER hparam 

Send a message and set up an extended handler 
procedure to be called with the value hparam 
when the send completes. Allows handler 
sharing. 


Extended Receive and Probe 


Table A-43. Fortran Calls for Extended Receive and Probe (1 of 2) 


Synopsis 

Description 

SUBROUTINE CHECVX(typesel, buf, count, 
nodesel, ptypesel, info) 

INTEGER typeset 

INTEGER bufi*) 

INTEGER count 

INTEGER nodesel 

INTEGER ptypesel 

INTEGER in/o( 8) 

Receive a message of a specified type from a 
specified sending node and process type, 
together with information about the message. 
Wait for completion. 

INTEGER FUNCTION IRECVX(typesel, 
buf, count, nodesel, ptypesel, info) 

INTEGER typesel 

INTEGER bufi*) 

INTEGER count 

INTEGER nodesel 

INTEGER ptypesel 

INTEGER infififi) 

Receive a message of a specified type from a 
specified sending node and process type, 
together with information about the message. 

Do not wait for completion. 
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Table A-43. Fortran Calls for Extended Receive and Probe (2 of 2) 


Synopsis 

Description 

SUBROUTINE HREC VX(typesel, buf, count, 
nodesel, ptypesel, xhandler, hparam) 

INTEGER typesel 

INTEGER bufi*) 

INTEGER count 

INTEGER nodesel 

INTEGER ptypesel 

EXTERNAL xhandler 

INTEGER hparam 

Receive a message of a specified type from a 
specified sending node and process type. Set 
up an extended handler procedure to be called 
with information about the message and the 
value hparam when the receive completes. 

SUBROUTINE CPROBEX(fypese/, nodesel, 
ptypesel, info) 

INTEGER typesel 

INTEGER nodesel 

INTEGER ptypesel 

INTEGER infix 8) 

Wait for a message of a specified type from a 
specified sending node and process type. 

Return information about the message. 

INTEGER FUNCTION IPROBEXOypere/, 
nodesel, ptypesel, info) 

INTEGER typesel 

INTEGER nodesel 

INTEGER ptypesel 

INTEGER infiXS) 

Determine whether a message of a specified 
type from a specified sending node and process 
type is pending. If it is, return information 
about the message. 
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Global Operations 


Table A-44. Fortran Calls for Global Operations (1 of 3) 


Synopsis 

Description 

SUBROUTINE GCOLOt, j den, y, ylen, ncnt ) 

INTEGER 4*) 

INTEGER xlen 

INTEGER X*) 

INTEGER ylen 

INTEGER ncnt 

Concatenation. 

SUBROUTINE GCOLX(*, xlens, y) 

INTEGER *(♦) 

INTEGER xlensi*) 

INTEGER y(*) 

Concatenation for contributions of known 
length. 

SUBROUTINE GDHIGHCx, n, work) 

DOUBLE PRECISION x(*) 

INTEGER n 

DOUBLE PRECISION world*) 

Vector double precision MAX. 

SUBROUTINE GDLOW(jc, n, work) 

DOUBLE PRECISION x(*) 

INTEGER/! 

DOUBLE PRECISION world*) 

Vector double precision MIN. 

SUBROUTINE GDPROD(r, n, work) 

DOUBLE PRECISION x(*) 

INTEGER n 

DOUBLE PRECISION world*) 

Vector double precision MULTIPLY. 

SUBROUTINE GDSUM(x, n, work) 

DOUBLE PRECISION **) 

INTEGER/! 

DOUBLE PRECISION world*) 

Vector double precision SUM. 

SUBROUTINE GIAND(x, n, work) 

INTEGER x(*) 

INTEGER/! 

INTEGER world*) 

Vector integer bitwise AND. 
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Table A-44. Fortran Calls for Global Operations (2 of 3) 


Synopsis 

Description 

SUBROUTINE GIHIGH(jt, n, work) 

Vector integer MAX. 

INTEGER**) 


INTEGER n 


INTEGER world*) 


SUBROUTINE GILOWCx, n, work) 

Vector integer MIN. 

INTEGER**) 


INTEGER n 


INTEGER world*) 


SUBROUTINE GIOR(*, n, work) 

Vector integer bitwise OR. 

INTEGER**) 


INTEGERn 


INTEGER world*) 


SUBROUTINE GIPRODQt, n, work) 

Vector integer MULTIPLY. 

INTEGER**) 


INTEGER n 


INTEGER world*) 


SUBROUTINE GISUMO, n, work) 

Vector integer SUM. 

INTEGER**) 


INTEGER n 


INTEGER world*) 


SUBROUTINE GLANDfo n, work) 

Vector logical AND. 

LOGICAL**) 


INTEGER n 


LOGICAL world*) 


SUBROUTINE GLOR(x, n, work) 

Vector logical inclusive OR. 

LOGICAL**) 


INTEGER/! 


LOGICAL world*) 
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Table A-44. Fortran Calls for Global Operations (3 of 3) 


Synopsis 

Description 

SUBROUTINE GOPF(x, xlen, work. 

Arbitrary commutative function. 

function) 


INTEGER**) 


INTEGER xlen 


INTEGER workd*) 


EXTERNAL function 


SUBROUTINE GSHIGHCt, n, work) 

Vector real MAX. 

REAL**) 


INTEGER n 


REAL work(*) 


SUBROUTINE GSLOWCc, n, work) 

Vector real MIN. 

REAL**) 


INTEGER n 


REAL wor**) 


SUBROUTINE GSPROD(x, n, work) 

Vector real MULTIPLY. 

REAL**) 


INTEGER/! 


REAL wor**) 


SUBROUTINE GSSUMQt, n, work) 

Vector real SUM. 

REAL**) 


INTEGER/! 


REAL wor**) 


SUBROUTINE GSYNCQ 

Global synchronization. 
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Controlling Application Execution 


Table A-45. Fortran Calls for Controlling Application Execution (1 of 2) 


Synopsis 

Description 

INTEGER FUNCTION 

NX_IN IT VE(parti tion, size, account, 
argc, argv ) 

Create a new application. 

CHARACTER partition *(*) 

INTEGER size 

CHARACTER account *(*) 

INTEGER argc 

INTEGER argv 


INTEGER FUNCTION 

NX_INITVE_RECT(partift'on, anchor, 
rows, cols, account, argc, argv ) 

Create a new application with a rectangular 
shape. 

CHARACTER partition*(*) 

INTEGER anchor 

INTEGER rows 

INTEGER cols 

CHARACTER account*(*) 

INTEGER argc 

INTEGER argv 


INTEGER FUNCTION NX_PRI(pgr<M<p, 
priority ) 

Set the priority of an application. 

INTEGER pgroup 

INTEGER priority 


INTEGER FUNCTION 

NX_NFORK (nodejist, numnodes, 
ptype, pidjist) 

Copy the current process onto some or all 
nodes of an application. 

INTEGER nodejisti*) 

INTEGER numnodes 

INTEGER ptype 

INTEGER pidJisK*) 
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Table A-45. Fortran Calls for Controlling Application Execution (2 of 2) 


Synopsis 

Description 

INTEGER FUNCTION 

NX._LOXD(nodeJist, mmnodes, ptype, 
pidjist, pathname) 

Execute a stored program on some or all nodes 
of an application. 

INTEGER nodejist?) 

INTEGER numnodes 

INTEGER ptype 

INTEGER pid list (*) 

CHARACTER pathname*?) 


INTEGER FUNCTION 

NX_LOADVE(/wrfe_fi«, numnodes, 
ptype, pidjist, pathname, argv, envp) 

Execute a stored program on some or all nodes 
of an application, with specified argument list 
and environment. 

INTEGER nodejist (*) 

INTEGER numnodes 

INTEGER ptype 

INTEGER pidjist?) 

CHARACTER pathname*?) 

INTEGER argv 

INTEGER envp 


SUBROUTINE NX_WATTALLO 

Wait for all application processes. 


Getting Information About Applications 


Table A-46. Fortran Calls for Getting Information About Applications 


Synopsis 

Description 

INTEGER FUNCTION 

NX_APP_RECT(r(WJ, cols) 

IN TEGER rows 

INTEGER cols 

Obtain the height and width of the rectangle of 
nodes allocated to the current application. 

INTEGER FUNCTION 

NX_APP_NODES(pgroKp, ptr, list_size) 

INTEGER pgroup 

POINTER (ptr, nodejist (1)) 

INTEGER list_size 

List the nodes allocated to an application. 
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Partition Management 


Table A-47. Fortran Calls for Partition Management (1 of 2) 


Synopsis 

Description 

INTEGER FUNCTION 

NX_MKPART(/wrrttion, size, type ) 

Create a partition with a particular number of 
nodes. 

CHARACTER partition *(*) 

INTEGER size 

INTEGER type 


INTEGER FUNCTION 

NX_MKPART_REC T(partitioti, rows, 
cols, type ) 

Create a partition with a particular height and 
width. 

CHARACTER partition *(*) 

INTEGER rows 

INTEGER cols 

INTEGER type 


INTEGER FUNCTION 

NX_MKPART_MAP(parrirton, 
numnodes, nodejist, type ) 

Create a partition with a specific set of nodes. 

CHARACTER partition *(*) 

INTEGER numnodes 

INTEGER nodeJisK*) 

INTEGER type 


INTEGER FUNCTION 

NX_RMPART (pathname, force, 
recursive ) 

Remove a partition. 

CHARACTER partition*(*) 

INTEGER force 

INTEGER recursive 


INTEGER FUNCTION 

NX_PART_ATT R(parti tion, attributes) 

Get a partition’s attributes. 

CHARACTER partition *(*) 

RECORD /nx_part_info_t/ attributes 
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Table A-47. Fortran Calls for Partition Management (2 of 2) 


Synopsis 

Description 

INTEGER FUNCTION 

NX_PART_NODES (parrmoH, ptr, 
list_size) 

List the root node numbers for the nodes of a 
partition. 

CHARACTER partition *(*) 

POINTER (ptr, nodejisdl)) 

INTEGER listjsize 


INTEGER FUNCTION 

NX_CHPART_NAME(parrifton, name ) 

Change a partition’s name. 

CHARACTER partition*(*) 

CHARACTER name*(*) 


INTEGER FUNCTION 

NX_CHPART_MOD(parfl'lion, mode) 

Change a partition’s protection modes. 

CHARACTER partition*(*) 

INTEGER mode 


INTEGER FUNCTION 

NX_CHPART_EPL(parri»on, priority) 

Change a partition’s effective priority limit. 

CHARACTER partition*(*) 

INTEGER priority 


INTEGER FUNCTION 

NX_CHPART_RQ(par»Tion, 

rollin_quantum) 

Change a partition’s rollin quantum. 

CHARACTER partition*(*) 

INTEGER rollinquantum 


INTEGER FUNCTION 

NX_CHPART_OWNER(parnrton, 
owner, group) 

Change a partition’s owner and group. 

CHARACTER partition *(*) 

INTEGER owner 

INTEGER group 


INTEGER FUNCTION 

NXCHPART_SCHED(parttfjon, 

schedjype) 

Change a partition’s rollin quantum. 

CHARACTER partition*(*) 

INTEGER schedjype 
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Finding Unusable Nodes 


Table A-48. Fortran Calls for Finding Unusable Nodes 


Synopsis 

Description 

INTEGER FUNCTION 

NX_EMPTY_NODES(prr, listjize) 

List the nodes that ate empty slots. 

POINTER ( ptr, node_list(l )) 

INTEGER listjize 


INTEGER FUNCTION 

NX_FAILED_NODES(p/r, listjize) 

List the nodes that failed to boot. 

POINTER (ptr, node_list(l)) 

INTEGER listjize 



Handling Errors 


Table A-49. Fortran Calls for Handling Errors 


Synopsis 

Description 

SUBROUTINE NXPERROROfrag) 

CHARACTER string *(*) 

Print an error message corresponding to the 
current value of ermo. 


Floating-Point Control 


Table A-50. Fortran Calls for Floating-Point Control 


Synopsis 

Description 

INTEGER FUNCTION FPSETMASK(mayfc) 

INTEGER mask 

Set the floating-point exception mask for the 
calling process. 
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Miscellaneous Calls 


Table A-51. Miscellaneous Fortran Calls 


Synopsis 

Description 

SUBROUTINE FLICK() 

Temporarily relinquish the CPU to another 
process. 

SUBROUTINE LED(state) 

Turn the node’s green LED on or off. 

INTEGER state 


DOUBLE PRECISION FUNCTION 

Return time in seconds since booting the 

DCLOCKQ 

system. 


iPSC® and Touchstone DELTA Compatibility 


Table A-52. Fortran Calls for iPSC® and Touchstone DELTA Compatibility (1 of 2) 


Synopsis 

Description 

SUBROUTINE FLUSHMSGOypese/, 
node sel, ptypesel) 

INTEGER typeset 

INTEGER node sel 

INTEGER ptypesel 

Flush specified messages from the system. 

INTEGER FUNCTION G1NV(graycode) 

Return the position of an element in the 
binary-reflected gray code sequence. Inverse 

INTEGER graycode 

of grayO. 

INTEGER FUNCTION GRAY (position) 

INTEGER position 

Return the binary-reflected gray code for an 
integer. 

SUBROUTINE HWCLOCK(fowime) 

INTEGER hwtime(2) 

Place the current value of the hardware counter 
into a 64-bit unsigned integer variable. 

INTEGER FUNCTION INFOPIDO 

Return the process type of the process that sent 
a pending or received message. 

SUBROUTINE KILLCUBEfnorfe, pid) 

INTEGER node 

INTEGER pid 

Terminate and clear node process(es). 
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Table A-52. Fortran Calls for iPSC® and Touchstone DELTA Compatibility (2 of 2) 


Synopsis 

Description 

SUBROUTINE KSLLVROC(node,pid) 

Terminate a node process. 

INTEGER node 

INTEGER pid 


SUBROUTINE hOAD(filename, node, pid) 

Load a node process. 

CHARACTER filename *(*) 

INTEGER node 

INTEGER pid 


INTEGER FUNCTION MCLOCKO 

Return the time in milliseconds. 

SUBROUTINE MSGCANCEL(muQ 

INTEGER mid 

Cancel an asynchronous send or receive 
operation. 

INTEGER FUNCTION MYPART(rows, 
cols) 

Obtain the height and width of the rectangle of 
nodes allocated to the current application. 

INTEGER rows 

INTEGER cols 


INTEGER FUNCTION MYPID() 

Return the process type of the calling process. 

INTEGER FUNCTION NODEDIM() 

Return the dimension of the current application 
(the number of nodes allocated to the 
application is 2 dimension ). 

INTEGER FUNCTION 

RESTRICTVOL(M«Jf. mol, vollist) 

Return 0 (does nothing; provided for 
compatibility only). 

INTEGER unit 

INTEGER nvol 

INTEGER vollisK*) 
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I/O Modes 


Table A-53. Fortran Calls for I/O Modes 


Synopsis 

Description 

SUBROUTINE GOFEN(unit, path, iomode) 

Open a file on all nodes and set its I/O mode. 

INTEGER unit 

CHARACTER path*/*) 

INTEGER iomode 


SUBROUTINE SETIOMODE(wwf, iomode ) 

Set the I/O mode for a file. 

INTEGER unit 

INTEGER iomode 


INTEGER FUNCTION IOMODEOmit) 

Return the current I/O mode for a file. 

INTEGER unit 



Reading and Writing Files in Parallel 


Table A-54. Fortran Calls for Reading and Writing Files in Parallel (1 of 2) 


Synopsis 

Description 

SUBROUTINE CREAD(«n«, buffer, mbytes) 

Read from a file, waiting for completioa 

INTEGER unit 

INTEGER buffer/*) 

INTEGER mbytes 


SUBROUTINE CWRITE(wwr. buffer, 
mbytes) 

Write to a file, waiting for completioa 

INTEGER umit 

INTEGER buffer/,*) 

INTEGER mbytes 


SUBROUTINE CREADVfimir, iov, iovcnt) 

INTEGER unit 

INTEGER iov(*) 

INTEGER iovcnt 

Read from a file to irregularly-scattered 
buffers, waiting for completion. 
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Table A-54. Fortran Calls for Reading and Writing Files in Parallel (2 of 2) 


Synopsis 

Description 

SUBROUTINE CWRITEV(umr, iov, iovcnt) 

INTEGER unit 

INTEGER iov(*) 

INTEGER iovcnt 

Write to a file from irregularly-scattered 
buffers, waiting for completion. 

INTEGER FUNCTION IREAD(m««, buffer, 
nbytes) 

INTEGER unit 

INTEGER buffer^*) 

INTEGER nbytes 

Read from a file without waiting for 
completion. 

INTEGER FUNCTION IWRITE(«wr, buffer, 
nbytes) 

INTEGER unit 

INTEGER buffer{*) 

INTEGER nbytes 

Write to a file, waiting for completion. 

INTEGER FUNCTION IREADV(uwf, iov, 
iovcnt) 

INTEGER unit 

INTEGER iov{*) 

INTEGER iovcnt 

Read from a file to irregularly-scattered 
buffers, without waiting for completion. 

INTEGER FUNCTION IWRITEVtumr, iov, 
iovcnt) 

INTEGER unit 

INTEGER iov(*) 

INTEGER iovcnt 

Write to a file from irregularly-scattered 
buffers, without waiting for completion. 

INTEGER FUNCTION IODONE(id) 

INTEGERS 

Determine whether an asynchronous I/O 
operation is complete. If complete, release the 
I/O ID. 

SUBROUTINE IOWAIT(id) 

INTEGER id 

Wait for completion of an asynchronous I/O 
operation and release the I/O ID. 
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Detecting End-of-File and Moving the File Pointer 


Table A-55. Fortran Calls for Detecting End-of-File and Moving the File Pointer 


Synopsis 

Description 

INTEGER FUNCTION ISEOF(uni/) 

Test for end-of-file. 

INTEGER unit 


INTEGER FUNCTION LSEEK(wmf, offset, 
whence) 

Move the read/write file pointer. 

INTEGER unit 

INTEGER offset 

INTEGER whence 



Flushing Fortran Buffered I/O 


Table A-56. Fortran Calls for Flushing Buffered I/O 


Synopsis 

Description 

SUBROUTINE FORCEFLUSHO 

Cause all buffered I/O to be flushed if an 
exception occurs. 

SUBROUTINE FORFLUSH(umr) 

INTEGER unit 

Flush all buffered I/O on a particular unit. 


Increasing the Size of a File 


Table A-57. Fortran Calls for Increasing the Size of a File 


Synopsis 

Description 

INTEGER FUNCTION LSIZE (unit, offset, 
whence) 

INTEGER unit 

INTEGER offset 

INTEGER whence 

Increase size of a file. 


A-47 




















Summary of Commands and System Calls 


Paragon™ User’s Guide 


Extended File Manipulation 


Table A-58. Fortran Calls for Extended File Manipulation 


Synopsis 

Description 

SUBROUTINE ESEEK(«nit, offset, whence, 
newpos ) 

Move file pointer in extended file. 

INTEGER unit 

INTEGER offset®) 

INTEGER whence 

INTEGER newpos(2) 


SUBROUTINE HSIZEiunit, offset, whence, 
newsize) 

Increase size of extended file. 

INTEGER unit 

INTEGER offset^.) 

INTEGER whence 

INTEGER newsizei 2) 
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Performing Extended Arithmetic 


Table A-59. Fortran Calls for Performing Extended Arithmetic 


Synopsis 

Description 

SUBROUTINE EADD(e^, el, eresult) 

INTEGER ei (2) 

INTEGER e2{2) 

INTEGER eresulK 2) 

Add two extended integers. 

INTEGER FUNCTION ECMP(ei, el) 

INTEGER el{2) 

INTEGER e2(2) 

Compare two extended integers. 

SUBROUTINE EDIV(e, n, result) 

INTEGER e(2) 

INTEGER n 

INTEGER result 

Divide extended integer by integer. 

SUBROUTINE EMOD(e, n, result) 

INTEGER e(2) 

INTEGER n 

INTEGER result 

Give extended integer modulo an integer 
(remainder when e is divided by n). 

SUBROUTINE EMUL(e, n, eresult) 

INTEGER e(2) 

INTEGER n 

INTEGER eresulti 2) 

Multiply extended integer by integer. 

SUBROUTINE ESUBfci, el, eresult) 

INTEGER el (2) 

INTEGER e2(2) 

INTEGER eresultQ) 

Subtract two extended integers. 

SUBROUTINE ETOS(e, s) 

INTEGER e(2) 

CHARACTER s(*) 

Convert extended integer to string. 

SUBROUTINE STOE(s, e) j 

CHARACTER j(*> 

INTEGER <?(2) 

Convert string to extended integer. 
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Introduction 


This appendix gives you information you can use to port programs to Paragon™ OSF/1 from the 
iPSC® series of supercomputers from Intel Supercomputer Systems Divisioa 

This appendix lists the differences between iPSC system commands and system calls and those of 
Paragon™ OSF/1, and suggests alternatives that you can use for commands and calls that are not 
supported. Commands and calls that are not listed here should work the same in Paragon OSF/1 as 
they do in the iPSC system. 


General Compatibility Issues 

In general, iPSC system programs can simply be recompiled and executed on the Paragon system. 

However, keep in mind the following basic differences between the two systems: 

• There is no SRM. The Diagnostic Station is used only for system administration; all software 
development is done either on remote workstations or on the Paragon system itself. Parallel 
applications are run only on the Paragon system. 

• Host programs are not directly supported. See “Host Calls” on page B-9 for more information. 

• The node network is a 2-D mesh rather than a hypercube. You might want to change the data 
distribution in your application to take advantage of the different system topology. 

• An application can run on my number of nodes from 1 to the size of the compute partition (up 
to several thousand nodes). If your application depends on the number of nodes being a power 
of two or no greater than 128, you should re-write it so that it works on any number of nodes. If 
this is not possible, you should have the application print an error message if nunmodesO is not 
a power of two or is too large for the application to handle. 
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* If a message arrives at a node before the receive for the message has been posted, the message 
is stored in a system buffer. In the iPSC system, the space available for these system buffers is 
the entire free physical memory of the node. In the Paragon system, this space is more limited 
(1M bytes by default). This limitation results from the fact that the Paragon system supports 
multiple processes per node. 


NOTE 

Because of this limitation, iPSC system applications that use large 
amounts of system message buffering may slow down or hang on 
the Paragon system, especially when run on large numbers of 
nodes. 


If this occurs, you can increase the system message buffering space with the -mbf switch, as 
described under “System Message Buffers” on page 8-16. However, it would be better to 
re-write the application so that receives are always posted before the message arrives, as 
discussed under “Avoid Message Buffering” on page 8-11. 

• The term process ID, or PID, is used differently. In the iPSC system, each process has a UNIX 
PID used by the OS and an NX PID used for message passing. In the Paragon system, the 
“UNIX PID” is just called the PID, and the “NX PID” is called the process type or ptype. 
Although the names have changed, the software works the same. For example, mypid() and 
infopidO are supported as equivalents to myptypeO and infoptypeO. Exception: on the 
iPSC/860, the NX PID is always 0; in the Paragon system, the process type can be any integer 
from 0 to 2,147,483,647 (2 31 - 1) inclusive (but is usually 0). 

• Force types (special message types that use a limited flow control technique) are fully supported 
and work the same as they do in the iPSC system. However, in the Paragon system regular 
messages are just as fast as force type messages, so force types are not needed for performance. 


New Features 

Paragon OSF/1 offers the following features that were not available on the iPSC system. You can 

use these features to improve the performance and readability of your programs. 

• You can use the complete set of OSF/1 commands on the Paragon system, as discussed in 
Chapter 2. 

• You can execute an application on multiple nodes just by typing its name on the command line, 
using command-line switches to control its execution, as discussed under “Running 
Applications” on page 2-11. 

• You can control the values of some important message-passing configuration parameters, as 
discussed under “Message-Passing Configuration Switches” on page 8-18. 
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• You can allocate groups of nodes of any size and shape, and control the scheduling 
characteristics of applications that run in them, as discussed under “Managing Partitions” on 
page 2-25 and “Managing Partitions” on page 4-27. 

• You can have more than one process per node, as discussed under “Process Characteristics” on 
page 3-3. When sending messages, you specify a process by its process type (equivalent to the 
“NX PID” in the iPSC system). 

• You can tell the system to discard an asynchronous message ID as soon as the send or receive 
completes with msgignoreO, as discussed under “Asynchronous Send and Receive” on page 
3-10. 

• You can merge together a number of asynchronous message-passing requests and wait for all 
of them to complete in a single call with msgmergeO, as discussed under “Merging Message 
IDs” on page 3-13. 

• You can pass a parameter to a message interrupt handler with hsendxO, as discussed under 
‘Treating a Message as an Interrupt” on page 3-18. 

• You can receive or probe for a message based on its sender, and receive information about a 
message along with the message, with the ~.x() calls, as discussed under “Extended Receive and 
Probe” on page 3-24. 

• You can use system calls to control the execution characteristics of parallel programs, as 
discussed under “Managing Applications” on page 4-2. 

• You can open a file on all nodes at once very efficiently with gopenO, as discussed under 
“Opening Hies in Parallel” on page 5-9. 

• You can read the same data from a file into all nodes at the same time very efficiently with the 
I/O mode M_GLOBAL, as discussed under “Using I/O Modes” on page 5-13. 

• You can read data into or write data from a series of scattered memory buffers with the 
—readvO and ...writevO calls, as discussed under “Reading and Writing Files in Parallel” on 
page 5-24. 

• You can find out the characteristics of PFS file systems (which are more configurable than CFS) 
with the getpfsinfoO and statp&O calls, as discussed under “Getting Information About PFS 
File Systems” on page 5-39. 

• You can use the HIPPI and FDDI network interfaces, as discussed in the Paragon ™ 
High-Performance Parallel Interface Manual and Paragon ™ Fiber Distributed Data Interface 
Installation and Configuration Guide. 

• You can use the Paragon application development tools to help you port and optimize your 
code, as discussed in the Paragon™ Application Tools User’s Guide. 
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Compilers 

The Paragon OSF/1 compilers work the same as the iPSC system compilers, with the following 

exceptions: 

• The compilers, linker, and other tools are now available on the Paragon system as well as on 
workstations. They can be called by the standard names (cc, f77. Id, and so on) as well as the 
names used in cross-development (icc, if77, Id860, and so on). 

• The environment variable that specifies the root of the compiler directory tree is called 
PARAGON_XDEV rather than IPSC XDEV. The default for this variable is now 
lusrlparagonlXDEV rather than lusr/ipsc/XDEV. 

• The compiler files are now found in the directory $PARAGON JCDEVIparagon rather than 
$IPSC_XDEVIi860. For example, your execution search path (path or PATH environment 
variable) should include the directory $PARAGON_XDEVIparagonlbm .qr£h (where arch 
identifies the architecture of the system, such as paragon or sun4) rather than 
SlPSCXDEVIi860lhin .arch or $IPSC_XDEVIi860lbin. 

• The -p switch is now ignored. See the Paragon * Application Tools User’s Guide for 
information on profiling. 

• The default for quad-alignment has been changed from -Mnoquad to -Mquad. This change 
results in up to four times better performance for some code. 

• The new switch -nx has been added. This switch generates a program that automatically starts 
itself on multiple nodes, as discussed under “Compiling and Linking Applications” on page 2-5. 
The switch -node is currently accepted as a synonym for -nx, but this support may be dropped 
in a future release. 

• You can now have a file called .icfrc in your home directory that defines the default compiler 
switches for you. 

See the Paragon™ Fortran Compiler User’s Guide or Paragon™ C Compiler User’s Guide for more 

information on the Paragon OSF/1 compilers. 


NOTE 

You cannot use the Paragon OSF/1 cross-compilers to produce 
programs for the iPSC system, and you cannot use the iPSC 
system cross-compilers to produce programs for Paragon OSF/1. 


If you develop programs for the iPSC system as well for Paragon OSF/1, you must be sure that your 
execution search path (PATH or path variable) is set appropriately for your current target system. To 
compile a program for Paragon OSF/1, the variable PARAGONJCDEV must be set appropriately 
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and your execution search path must include $PARAGON_XDEVIparaeon!bin Arch : to compile a 
program for the iPSC system, the variable IPSCXDEV must be set appropriately and your execution 
search path must include $IPSC_XDEVIi860/bin .arch instead. Be sure that your execution search 
path does not include both these directories at the same time. 


Commands 

In general, all of the standard commands of UNIX System V are supported by Paragon OSF/1, but 
none of the iPSC-system-specific commands are supported. However, many of these commands are 
not needed in Paragon OSF/1, or have equivalent standard commands in OSF/1. 


Cube Control Commands 

The usage model of Paragon OSF/1 is different from that of the iPSC system. Instead of allocating 
a cube with a certain number of nodes, loading a program onto the cube, and then releasing the cube, 
you run a parallel application simply by typing its name on the Paragon OSF/1 command line. You 
can use command-line arguments to control its execution characteristics (such as the number of 
nodes on which it runs), and you can use standard OSF/1 process control commands such as kill to 
control the program. (See Chapter 2 for more information on running and controlling applications in 
Paragon OSF/1.) 

For this reason, the following iPSC system commands, which create and control cubes, are not 
supported in Paragon OSF/1: 

archcube This command is not needed in Paragon OSF/1 because all nodes currently 

have the same architecture. 

attachcube This command is not needed in Paragon OSF/1 because you do not have to 
attach to a cube before you can use it. 

cubeiiifo Use the Ispart command to list the available partitions. See “Listing 

Subpartitions” on page 2-49 for more information. 

getcube Use the -sz switch on the application command line to specify the number of 

nodes allocated to the application See “Specifying Application Size” on page 
2-15 for more information 

The mkpart command is similar to getcube in that it allocates a partition (a 
group of nodes). However, partitions are not the same as cubes: partitions can 
overlap, and a partition can be used by several applications at once. 
Depending on the policies of your site, you may or may not be allowed to 
allocate partitions. See “Making Partitions” on page 2-39 for more 
information. 
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killcube 

load 

newserver 

reicube 


startcube 

syslog 

waitcube 


Use the OSF/1 kill command to kill a running application, or press your 
interrupt key (<ctrl-c> or <Del>). See “Managing Running 
Applications” on page 2-23 for more information. 

Type an application’s filename on the command line to run it on multiple 
nodes. See “Running Applications” on page 2-11 for more information. 

This command is not needed in Paragon OSF/1 because you can use the usual 
OSF/1 I/O redirection characters to redirect an application’s output. See “I/O 
Redirection” on page 2-12 for more information. 

This command is not needed in Paragon OSF/1 because you do not have to 
release a cube once you have used it. The nodes allocated to an application 
are automatically released when all the processes in the application have 
terminated. 

The impart command is similar to reicube in that it deallocates a partition 
(a group of nodes). However, partitions are not the same as cubes: partitions 
can overlap, and a partition can be used by several applications at once. 
Depending on the policies of your site, you may or may not be allowed to 
remove partitions. See “Removing Partitions” on page 2-45 for more 
information. 

This command has no equivalent in Paragon OSF/1. There is no way to load 
an application into the nodes’ memory without starting it. 

This command is not needed in Paragon OSF/1 because you can use the usual 
OSF/1 I/O redirection characters to redirect an application’s output. The 
standard I/O of a node process is connected to the same files or devices as the 
standard I/O of its controlling process. See “I/O Redirection” on page 2-12 
for more information. 

This command is not needed in Paragon OSF/1 because, by default, your 
command prompt does not return until the application has completed. Also, 
you can redirect the output of any program with the usual OSF/1 I/O 
redirection characters (see “I/O Redirection” on page 2-12 for more 
information). 
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CFS Commands 

The following iPSC system commands, which control the Concurrent File System and the SRM tape 
drive, are not supported in Paragon OSF/1: 

cptape Use the cpio command instead. See cpio in the OSFI1 Command Reference 

for more information. 

showvol Use the showfs command instead. See “Displaying File System Attributes” 

on page 5-5 for more information. 

star Use the tar command instead. (Note that you must use the -E switch to 

archive a file larger than 2G-1 bytes.) See tar in the Paragon ™ Commands 
Reference Manual and OSFI1 Command Reference for more information. 

stream This command is not needed in Paragon OSF/1 because there is no streaming 

tape drive. 

tapemode This command currently has no equivalent in Paragon OSF/1. There is no 

way to display or change the operating mode of the system’s tape drives. 


System Administration Commands 

The following iPSC system commands, which are used for system administration, are not supported 
in Paragon OSF/1: 

cbackup Use the dump command instead. See dump in the OSF/J System and 

Network Administrator’s Reference for more information. 

cfechk Use the feck command instead. See feck in the OSF/1 System and Network 

Administrator’s Reference for more information. 

crestore Use the rdump command instead. See rdump in the OSF/1 System and 

Network Administrator’s Reference for more information. 

makewhatis Use the catman command instead. See catman in the OSF/1 Command 

Reference for more information. 

mkcfe Use the newfe command instead. See newfe in the OSF/1 System and 

Network Administrator’s Reference for more information. 
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mkdev Use the mknod command instead. See tnknod in the OSFI1 System and 

Network Administrator’s Reference for more informatioa 

plogon and plogoff 

These commands currently have no equivalent in Paragon OSF/1. There is 
currently no way to log creation and deletion of partitions or running of 
applications. However, you can use the syslogd daemon to log other system 
activity. See sysiogd in the OSF/1 System and Network Administrator’s 
Reference for more informatioa 


Remote Host Commands 

The following iPSC system commands, which are used for program development on remote hosts, 
are not supported in Paragon OSF/1: 

rf77 Use the if77 command instead. See the Paragon M Fortran Compiler User’s 

Guide for more informatioa 

rcc Use the icc command instead. See the Paragon™ C Compiler User’s Guide 

for mote information. 

rid Use the ld860command instead. See the Paragon'* Fortran Compiler User’s 

Guide or Paragon * C Compiler User’s Guide for more informatioa 

ras Use the as860 command instead. See the Paragon* Fortran Compiler 

User’s Guide or Paragon™ C Compiler User’s Guide for more information. 

rar Use the ar860xommand instead. See the Paragon™ Fortran Compiler 

User’s Guide or Paragon™ C Compiler User’s Guide for more information. 


Miscellaneous Commands 

The following iPSC system commands are not supported in Paragon OSF/1: 

less Use the more command instead. See more in the OSF/1 Command Reference 

for more informatioa 

man path Use the MAN PATH environment variable instead. See man in the OSF/1 

Command Reference for more informatioa 
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nsh Use the riogin or telnet command to log into the Paragon system from your 

workstation. See riogin or telnet in your workstation’s documentation for 
more information. 

rebootcube This command has no equivalent in Paragon OSF/1. There is no way for 
ordinary users to reboot the system. 


System Calls 

In general, all of the standard system calls of UNIX System V and most of the iPSC-system-speciftc 
system calls are supported by Paragon OSF/1. This section suggests alternatives for the unsupported 
calls. 

NOTE 

Some iPSC calls are provided for backward compatibility only, and 
are not intended for use in new programs. These calls are not 
documented in the online manpages or in the Paragon" C System 
Calls Reference Manual or Paragon m Fortran System Calls 
Reference Manual. See “iPSO® and Touchstone DELTA 
Compatibility Calls” on page 4-52 for a list of these calls. 


Include Files 

Paragon OSF/1 does not support the iPSC system include files <cube.h> or <fcube.h>. You should 
replace any reference to <cube.h> with <nx.h>, and any reference to <fcube.h> with <fnx.h>. 


Host Calls 


Applications in Paragon OSF/1 do not usually have host programs. The usual programming model 
in Paragon OSF/1 is to write a single program (which corresponds to a “node program” in the iPSC 
system), link it with -nx, and execute the program on a group of nodes by typing its name (see 
“Running Applications” on page 2-11 for more information). You may be able to eliminate all 
references to the following unsupported calls by rewriting your program to use this programming 
model. If your application requires a separate host program, you can rewrite your host program into 
a controlling process (see “Managing Applications” on page 4-2 for more information). 
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For this reason, the -host switch to the cc and f77 commands is not supported (there is no separate 
host library; host programs use the same library as node programs). Also, the following iPSC system 
calls, which are used in host programs, are not supported in Paragon OSF/1: 

attachcubeO This call currently has no equivalent in Paragon OSF/1. Unlike a host 
program, a controlling process cannot be associated with more than one 
application. Consider re-writing your host program as two or more separate 
programs, each of which creates one application and communicates with the 
other host program(s) using pipes, signals, or some other OSF/1 interprocess 
communication method. See “Managing Applications” on page 4-2 for 
information on creating and controlling applications using system calls. 

cubeinfoO This call currently has no equivalent in Paragon OSF/1. However, because 
allocation of nodes in Paragon OSF/1 is not exclusive, it is not usually 
necessary for programs to know how other users have allocated nodes. To get 
information on your own application (equivalent to the “current cube”), you 
can use calls such as numnodesO- 

getcubeO Use nxinitveO instead. See “Managing Applications” on page 4-2 for 
information on nx_lnitveO. 

kUlcubeO This call is supported, but can only be used to kill and flush all processes on 
all nodes (ldllcube(-l,-l)). 

You can use killO to kill a single process, as discussed for killprocO below. 

killprocO This call is supported, but can only be used to kill all processes on all nodes 
(killproc(-l,-l)). 

You can use klUQ to kill a single process, given its OSF/1 process ID. kino 
is supported in both C and Fortran. To determine the OSF/1 process ID of a 
process created by nx_nfork(), nx_load(), or nx_loadve(), use the values 
stored into the pidjtrray argument. These calls store the OSF/1 PIDs of the 
processes created into the elements of this array, as discussed under “Using 
PIDs” on page 4-14. 

For example, to kill the process on node number node : 

#include <signal.h> 

n = nx_nfork(NULL, -1, ptype, pid_array); 


kill(pid_array[node], SIGKILL); 
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killsyslogO 


newserverQ 


rekubeO 


setpidO 


setsyslogQ 


Note that process types (ptype in this example) in Paragon OSF/1 are 
equivalent to NX PIDs in the iPSC system. PIDs (pidarray in this example) 
in Paragon OSF/1 are standard UNIX process IDs. 

See the OSFI1 Programmer’s Reference for information on killO; see 
“Managing Applications” on page 4-2 for information on nx_nfork(), 
nxJoadO, and nxJoadveO- 

Use freopenO instead, to close the standard output and standard error output 
and reopen them to Idevltty. See freopenO in the OSFI1 Programmer’s 
Reference for more information. 

freopenO is not currently supported for Fortran programs. However, it is 
supported for C programs. You can write a C “wrapper” function, as follows: 

#include <stdio.h> 

void killsyslog_() { 

freopen( "/dev/tty", V, stdout); 
freopen("/dev/tty", "w", stderr); 

} 

Note the underscore at the end of the function name. Once you have compiled 
this function and linked it into your Fortran program, you can call killsyslogO 
as described in the iPSC system documentation. 

This call is not necessary in Paragon OSF/1. The standard I/O of a controlling 
process (host process) is connected to the same files or devices as the standard 
I/O of its node processes. 

This call is not necessary in Paragon OSF/1. The nodes allocated to an 
application are automatically released when all the processes in the 
application have terminated. 

Use setptypeO instead. “Process Characteristics” on page 3-3 for information 
on setptypeO, and “Message Passing Between Controlling Process and 
Application Processes” on page 4-25 for information on using setptypeO in 
a controlling process. 

This call is not necessary in Paragon OSF/1. The standard I/O of a controlling 
process (host process) is connected to the same files or devices as the standard 
I/O of its node processes. 
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waitallQ 


To wait for all processes on all nodes (waitall(-l, >1)), call nx_waitall(). See 
“Waiting for Application Processes with nx_waitall()” on page 4-14 for more 
information. 

To wait for a single node process (waitalKmxfe, pid)), use the OSF/1 system 
call waitpidO to wait for the process with a particular OSF/1 process ID. To 
determine the PID of a process created by nx_nfork(), nx_load(), or 
nx_loadveO, use the values stored into the pidjorray argument. These calls 
store the OSF/1 PIDs of the processes created into the elements of this array, 
as discussed under “Using PIDs” on page 4-14. 

For example, to wait for the process on node number node. 

n = nx_nfork(NULL, -1, ptype, pid_array); 


waitpid(pid_array[node], sstatus, 0); 

Note that process types (ptype in this example) in Paragon OSF/1 are 
equivalent to NX PIDs in the iPSC system. PIDs (pidjuray in this example) 
in Paragon OSF/1 are standard UNIX process IDs. 

See the OSF/1 Programmer’s Reference for information on waitO and 
waitpidO; see “Managing Applications” on page 4-2 for information on 
nx_nfork(), nx_load0, and nxloadveO- 

waitO is supported in both C and Fortran, but waitpidO is not currently 
supported in Fortran. You can make waitpidO callable from Fortran by 
writing a C “wrapper” function, as follows: 

#include <sys/types.h> 

#include <sys/wait.h> 

int waitpid_(int *process_id, 

int *status_location, 
int *options) { 

return((int)waitpid((pid_t)*process_id, 

status_location, 
♦options); 

} 

Note the underscore at the end of the function name. Once you have compiled 
this file and linked it into your Fortran program, you can call waitpidO as 
described in the OSFI1 Programmer’s Reference. The wrapper function 
waitpidO takes three integer*4 parameters and returns an integer*4 value. 
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waitoneO To wait for the first node process in the entire application to complete 

(waitone(-l, - 1 , cnode, cpid, ccode)), use the OSF/1 system call waitO- For 
example: 

n = nx_nfork(nodes, NUMNODES, ptype, pids); 


pid = wait(^status); 

After this call, the status of the first process to complete is stored in status and 
its OSF/1 process ID is stored in pid. To determine the process’s node 
number, look for the value of pid in the pids array returned by nx_nfork(), 
nx_load(), or nxloadveO- 

To wait for a single node process (waitone/ruxfe, pid, cnode, cpid, ccode)), 
use the same technique described for waitall(node, pid): 

n * nx_nfork(NULL, -1, ptype, pid_array); 


pid = waitpid(pid_array[node], &status, 0); 

In this case, the status of the process is stored in status and its OSF/1 process 
ID is stored in pid. To determine the process’s node number, look for the 
value of pid in pid array as described above. 

See the OSF/1 Programmer’s Reference for information on waitO and 
waitpidO; see “Managing Applications” on page 4-2 for information on 
nx_nfork(), nxloadO, and nxJoadveO- waitO is supported in both C and 
Fortran, but waitpidO is not; to call waitpidO from Fortran, use the technique 
discussed previously under waitallO- 
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Byte-Swapping Calls 

The calls listed in Table B-1, which swap bytes between the format used on the cube and the format 
used on some remote hosts, are not supported in the current release of Paragon OSF/1. 


Table B-l. Unsupported iPSC® System Byte-Swapping Calls 


createstrucO 

CTOHFO 

HTOCCO 

HTOCLO 

CTOHCO 

CTOHLO 

HTOCDO 

HTOCSO 

CTOHDO 

CTOHSO 

HTOCFO 

relstrucO 


You can use the standard OSF/1 system calls htonlO, htonsO, ntohlO, and ntohsO to swap bytes 
between the standard format for your machine and the Internet network format See btonl(), htonsO, 
ntohlO, and ntohsO in the OSFIl Programmer’s Reference for more information. 

htonlO, htonsO, ntohlO, and ntohsO are not currently supported for Fortran programs. However, 
they are supported for C programs. You can make them callable from Fortran by writing C 
“wrapper” functions, as follows: 

#include <netinet/in.h> 

long htonl_(long *hostlong) { 

return((long)htonl((unsigned long)*hostlong); 

1 

short htons_(short *hostshort) { 

return((short)htons((unsigned short)*hostshort); 

} 

long ntohl_(long *netlong) { 

return((long)ntohl((unsigned long)*netlong); 

} 

short ntohs_(short *netshort) { 

return((short)ntohs((unsigned short)*netshort); 

} 

Note the underscore at the end of each function name. Once you have compiled this file and linked 
it into your Fortran program, you can call htonlO, htonsO, ntohlO, and ntohsO as described in the 
OSF/1 Programmer’s Reference. The wrapper functions htonlO and ntohlO take an integer*4 
parameter and return an integer*4 value; die wrapper functions htonsO and ntohsO take an 
integer* 2 parameter and return an integer*2 value. 
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Floating-Point Control Calls 

The Paragon OSF/1 C system calls fpgetstickyO and fpsetstkkyO, which get and set the i860 
microprocessor’s floating-point exception sticky flags, and fpgetmask() and fjpsetmaskO, which get 
and set the floating-point exception mask, do not support the exception value FPXDNML, which 
represents a denormalization exception in the iPSC system. 

The Paragon OSF/1 Fortran system call fjpsetmaskO also does not support the denormalization 
exception, and uses different numeric values to represent the various exceptions than the 
corresponding iPSC system call. See “Controlling Floating-Point Behavior” on page 4-46 for the 
correct values for Paragon OSF/1. 


CFS Calls 


In Paragon OSF/1, the iPSC system’s Concurrent File System (CFS) has been replaced by the 
Parallel Hie System (PFS). PFS calls are compatible with CFS calls; however, PFS offers additional 
functionality (see Chapter 5 for more information). This section lists the differences that may affect 
some programs that use CFS calls. 

ireadO and iwrite/) 

These calls work in Paragon OSF/1 just as they do in the iPSC system. In both 
systems, the number of I/O IDs is limited; however, the limit in Paragon 
OSF/1 is much smaller than in the iPSC system. (In the iPSC system the limit 
is 5000, in Paragon OSF/1 it is at least 256, but may vary from release to 
release.) For this reason, it is very important that you use iodoneO or iowaitO 
to release each ID as soon as possible after you use it. If you program in C, 
you can use ireadO or JwriteO to detect the “too many requests” error 
(EQNOMID). 

open() Many iPSC system programs use code like the following to open a file on all 

nodes: 

if(mynode() == 0) { 

fd = open("myfile", 0_CREAT | 0_RDWR, 0644); 
gsync(); 

} else { 

gsync(); 

fd = open("myfile", 0_RDWR, 0644); 

} 

setiomode(fd, iomode); 

The open() call works the same in Paragon OSF/1 as it does in the iPSC 
system. However, if this code is executed on many nodes, the large number 
of open/) requests arriving simultaneously at the I/O node can cause the I/O 
node to slow down, hang, or even crash. This can even cause the system to 
crash. 
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You should always use the gopenO call instead of this type of code. For 
example, you should replace the lines shown above with the following: 

fd = gopen("myfile", 0_CREAT | 0_RDWR, 
iomode, 0644); 

gopen() opens a file simultaneously on all nodes and sets its VO mode in a 
single operation. It is much more efficient than having each node call open(), 
and avoids this type of system crash completely. 

Note that gopenO opens the same file on each node. If each node is opening 
its own file, you must still use openO. However, you should try to avoid using 
openO together with gsyncO, to prevent all the openO requests from arriving 
at the I/O node at the same time. 


Miscellaneous Calls 


The following iPSC system calls work differently or are not supported in Paragon OSF/1: 


dclockf) This call works in Paragon OSF/1 just as it does in the iPSC system: it returns 

the time since the system was booted, in seconds. However, in a 
gang-scheduled partition your application may be rolled out and then rolled 
in again. While it is rolled out, the application is stopped but the dclockf) 
clock keeps going (reflecting “wall-clock” time), which means that ddockO 
cannot be used to determine the amount of time your application has actually 
been running. 

Using ddockO in a gang-scheduled partition may result in incorrect 
MFLOPS estimates. You can use the time command, getrusageO system call 
(C only), or the etimeO or dtimeO routine (Fortran only) instead to determine 
your application’s CPU usage. See the OSFil Command Reference for 
information on time, the OSF/1 Programmer’s Reference for information on 
getrusageO, and the Paragon * Fortran Compiler User’s Guide for 
information on etimeO and dtimeO. 

flushmsgO This call currently has no equivalent in Paragon OSF/1. It may be supported 
in a future release. 


getiphostsO This call currently has no equivalent in Paragon OSF/1. However, because the 
OSF/1 operating system automatically routes network traffic using all 
available Ethernet ports, it is not usually necessary to know the network 
names of the available ports. 

gixorO This call is not supported in Paragon OSF/1. The exclusive OR operator is not 

associative, and gives unpredictable results when used on more than two 
nodes. 
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glxorO This call is not supported in Paragon OSF/1. The exclusive OR operator is not 

associative, and gives unpredictable results when used on more than two 
nodes. 

handlerO Use the signalO system call instead (signalO is supported for both C and 

Fortran). See signalO in the OSFI1 Programmer’s Reference for information 
on signal handling; see signalO in the Paragon™ Fortran Compiler User’s 
Guide for information on the Fortran interface to signalO- 

plogonO and plogoffD 

These calls currently have no equivalent in Paragon OSF/1. There is currently 
no way to automatically log creation and deletion of partitions or running of 
applications. However, you can use the syslogO call to log activities under 
program control. See syslogO in the OSFI1 Programmer’s Reference for 
more information. 

setiphostO This call is not necessary in Paragon OSF/1. The OSF/1 operating system 
automatically routes network traffic using all available Ethernet ports; it is not 
necessary to select one port to perform network operations. 

setpgrpO There are two different versions of this call in Paragon OSF/1. The standard 
version of setpgrpO, found in libbsd.a, is equivalent to setpgidO and is not 
compatible with the iPSC/860 version. The System V version of setpgrpO, 
found in libsys5.a , is equivalent to setsidO and is compatible with the 
iPSC/860 version. To get the iPSC/860-compatible version, be sure to use the 
switch -lsys5 when linking. 


Summary 


Table B-2 summarizes the Paragon OSF/1 equivalents for the unsupported iPSC system commands. 

Table B-2. Summary of Unsupported iPSC® System Commands (1 of 2) 


iPSC® System Command 

Paragon™ OSF/1 Equivalent 

archcube 

(none) 

attachcube 

(none) 

cbackup 

dump 

cfschk 

fsck 

cptape 

cpio 

crestore 

rdump 

cubeinfo 

lspart 
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Table B-2. Summary of Unsupported iPSC® System Commands (2 of 2) 


iPSC® System Command 

Paragon™ OSF/1 Equivalent 

getcube 

•sz switch on application command line 

killcube 

kill 

less 

more 

load 

Application’s filename 

makewhatis 

catman 

man path 

MANPATH environment variable 

mkcfs 

newfs 

mkdev 

mknod 

newserver 

I/O redirection characters 

nsh 

rlogin or telnet 

plogoff 

(none) 

plogon 

(none) 

rar 

ar860 

ras 

as860 

rcc 

kc 

rebootcube 

(none) 

rekube 

(none) 

rf77 

vni 

rid 

Id860 

showvol 

showfs 

star 

tar 

startcube 

(none) 

stream 

(none) 

syslog 

I/O redirection characters 

tapemode 

(none) 

waitoube 

(none) 
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Table B-3 summarizes the Paragon OSF/1 equivalents for the unsupported iPSC system calls. 

Table B-3. Summary of Unsupported IPSC® System Calls (1 of 2) 


IPSC® System Call 

Paragon™ OSF/1 Equivalent 

attachcubeO 

(none) 

cubeinfo() 

(none) 

ddockO 

Supported, but use getrusageO (C) or 
etimeO/dtimeO (Fortran) to determine CPU 
usage in gang-scheduled partitions. 

fpgetstickvO, fpsetstickyO. fpgetmask(), 
fpsetmaskO 

Supported, except for FP X DNML, but 
Fortran mask values are different. 

flushmsgO 

(none) 

getcubeO 

nx_initve() 

gedphostsO 

(none) 

gixorO 

(none) 

glxor() 

(none) 

handler() 

signalO 

ireadO, iwriteO 

Supported, but number of I/O IDs is much 
smaller. 

killcubeO 

Use killcube(-l,-l) to kill and flush all 
processes; use kiil() to kill one process 

kUlsyslogO 

(none) 

killprocO 

Use kiUproc(-l,-l) to kill all processes; use 
kill() to kill one process 

newserver() 

freopen() 

openO 

Supported, but use gopenO instead if possible. 

plogoffO 

(none) 

plogonO 

(none) 

rekubeO 

(none) 

sedphostO 

(none) 

setpgrpO 

Supported, but be sure to link with -lsys5 to get 
the correct version. 
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Table B-3. Summary of Unsupported iPSC® System Calls (2 of 2) 


iPSC® System Call 

Paragon™ OSF/1 Equivalent 

setpidO 

setptypeO 

setsyslogO 

(none) 

waitallO 

Use nx waitallQ to wait for all processes; use 
waitO or waitpidO to wait for one process 

waitone() 

wait() or waitpidO 

Byte-swapping calls 

htonlQ, htonsQ, ntohlQ, and ntohsO 
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### in filenames 5-31 

. (dot) in partition pathnames 2-28 

. (root) partition 2-26 

.compute partition 2-28 

.F extension 2-9 

.service partition 2-28 

/cfs directory 5-13 

/dev/io0/rmt6 device 5-46 

/pfs file system 5-5 

/usr/ccs/lib directory 2-8 

/usr/indude directory 2-8 

/usr/lib directory 2-8 

/usr/paragon/XDEV directory 2-9 

/usr/tmp directory 5-13 

\;file (second program in an application) 2-21 

_NODE preprocessor symbol 2-5 

_cread() system call 5-26 
_creadvO system call 5-26 
_crecv() system call 4-42 
_cwrite() system call 5-26 
_cwritev() system call 5-26 


_r calls 6-10 

.REENTRANT preprocessor symbol 6-5,6-41 
-1 

as error return 4-42 
as message type 3-6 
as node number 3-3,4-25 
as process type 3-4 
as sending node number 3-24 
as sending process type 3-24 

64-bit integers 5-37 

A 

absolute partition pathname 2-28 

access methods 5-13 
synchronization of 5-48 

access() system call 5-31 

accessing contiguous memory locations 8-5 

active and inactive applications 2-35 

address space 1-5 

algorithms 8-1 

aligning application buffers 8-12 
aligning I/O buffers 8-25 
ALLOCATE statement 8-8,8-15 
allocating memory 8-8 
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allocating nodes to a partition 2-30,2-40 

allocating nodes to an application 2-15,4-4 

allocating space to a file 5-7,5-32 

allocator configuration parameters 2-36 

alternate node topologies 7-6 

anonymous files 5-13 

applicable documents vii 

application buffers 
aligning 8-12 

"application" command 2-13, A-2 


applications 1-1,2-1 

active and inactive 2-35 

allocating nodes to an application 4-4 

Bourne shell 2-24 

compiling and linking 2-5 

compiling, linking, and executing 2-3 

contiguous nodes 2-16,4-5,4-28 

control decomposition 7-5 

controlling execution characteristics 2-13 

controlling process 2-24,2-28,4-4,4-21 

controlling with system calls 4-2 

creating and controlling 4-4 

debugging 2-34 

decomposition 7-3 

default partition 2-14 

designing 7-1 

designing a communication strategy 7-6 

domain decomposition 7-3 

error handling 4-42 

error messages 2-12 

executing 2-11 

-gth switch 8-18 

I/O redirection 2-12 

independent of number of nodes 7-5 

interactive 2-34 

killing application processes 4-23 

listing the applications in a partition 2-51 

-Inx compiler switch 2-11 

load balancing 7-3 

managing running applications 2-23 

matrix*vector example 7-11 

-mbf switch 8-16 

-mea switch 8-17 

message buffers 3-14 

message passing with controlling process 4-25 

-mex switch 8-19 

-noc switch 8-17 

node numbers 3-3 

nqueens example 7-13 

-nx compiler switch 2-11 

-on switch 2-19 

order of switches 2-13 

overlapping 2-36 

partition of 2-14 

perfectly-parallel 7-2 
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performance improvement techniques 8-1 

pi example 7-7 

-pkt switch 8-16 

-plk switch 8-3,8-15 

-pn switch 2-22 

-pri switch 2-17 

priority of 2-17,2-35,4-9 

process type of 2-18,3-4 

-pt switch 2-18, 3-4 

rectangular dimensions of 4-16 

rectangular size 2-16 

removing partitions containing 2-45 

running in a particular partition 2-22 

running multiple programs 2-21 

running on a subset of the nodes 2-18 

-set switch 8-18 

separating the user interface from the 
computation 7-3 
shell scripts 2-24 
size of 2-15 
-sth switch 8-18 
-sz switch 2-16,3-3 
waiting for application processes 4-14 

arbitration between processes 2-33 

arbitration mechanisms 6-1 

archcube command B-5 

architecture of your workstation 2-6 

arge and argv parameters 4-5,4-13 

arithmetic, extended 5-37 

arrays 

accessing contiguous memory locations 8-5 
dynamic allocation of 8-8 
large 8-8 

memory layout in C and Fortran 8-5 
arrow LEDs 1-4 

assembly language programming 8-1,8-7 
asynchronous and synchronous calls 8-10 
asynchronous and synchronous I/O calls 8-24 
asynchronous cancelability 6-29 


asynchronous file I/O calls 5-27 

asynchronous message-passing calls 3-6,3-7, 
3-10 

with interrupt handler 3-7 
attacheube command B-5 
attachcubeO system call B-10 
attributes of file systems 5-5 
attributes of pthreads 6-15 
available nodes 2-48 
avoiding virtual memory paging 8-3 

B 

backward compatibility calls B-9 

backward library references 2-10 

bad nodes 2-31,2-48,4-40 

balancing the load among the nodes 7-3 

Basic Linear Algebra Subroutines (BLAS) 7-13 

Basic Math Library (libkmath.a) 8-6 

bg command 2-23 

binary files in PFS file systems 5-4 

BLAS library 8-6 

blocking 3-8,5-48 

on child processes 4-14 
with pthreads 6-37 

blocks, file system 8-25 

bold text vi 

Bourne shell (sh) 2-24 
with pthreads 6-39 

brackets, in syntax descriptions vii 

broadcast 3-9 
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buffering 

of Fortran I/O 5-30 
of messages 3-14 
of standard I/O 2-12 

buffers for I/O 8-25 

buffers for messages 8-11 
aligning 8-12 

buffers, message 8-16 

bureaucracy in node programs 7-5 

bytes read or written 5-26 

byte-swapping calls B-14 

C 

C library, reentrant 6-2 

C preprocessor on a Fortran program 2-9 

C programs 

error handling 4-42,8-7 

file descriptors 5-8 

including nx.h 2-8 

memory access considerations 8-5 

pointers to message buffers 8-13 

structure padding 8-13 

cache lines 8-5,8-12 

caches 
data 8-5 
instruction 8-6 

caching 8-5 

canceling pthreads 6-28 
cat command 5-5,5-35 
cbackup command B-7 


cc command 2-4,2-5, A-1 
-host switch B-10 
-I switch 2-9,2-10 
-L switch 2-9,2-10 
-Mquad switch B-4 
-node switch B-4 
-nx switch 2-5, B-4 
order of switches 2-10 
-p switch B-4 

CFS 5-1, B-7 

/cfs directory 5-13 

CFS_MOUNT environment variable 5-13 
^fschk command B-7 

changing partition characteristics 2-54,4-36 

changing process type 3-4 

characteristics of a partition 2-29 
default 2-29 

characteristics of processes 3-3 

chdir() system call 5-31,8-9 
with pthreads 6-9 

chess example 7-13 

chgrp command 5-35 

child partitions 2-29,2-34 
creating 2-39 
listing 2-49 
removing 2-45 

removing partitions containing 2-45 
child processes 4-10,4-11,4-14 
chmod command 5-35 
chmod() system call 5-31,8-9 
chown command 5-35 
chown() system call 5-31 
chpart command 2-29,2-54, A-2 
CLASSPACK 8-6 
cleanup routines for pthreads 6-32 
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clock, global 4-50 

closeO system call 5-28 
synchronization 5-48 

closing parallel files 5-15,5-28 

code segment 8-15 

commands 2-1 

compiling and linking applications 2-5 
cross-development 2-5 
executing applications 2-11 
iPSC system compatibility B-5 
managing partitions 2-25 
managing running applications 2-23 
native 2-5 

on the Intel supercomputer 2-2 
on workstations 2-2 
PFS5-5 
summary A-1 

commons in message passing 3-17 
communication 

overlapping with computation 8-10 

compatibility with the iPSC system B-1 

compiler switches 2-5,2-8,2-10, B-4 

compilers, iPSC system 2-7 
compatibility with B-4 

compiling and linking 
optimization 8-3 

compiling and linking applications 1-5,2-5 
-host switch B-10 
-Inx switch 2-6 
order of 2-10 
-Mquad switch B-4 
-node switch B-4 
-nx switch 2-5, B-4 
-p switch B-4 
quick example 2-6 

specifying include file pathnames 2-8 
specifying library pathnames 2-8 
tips 2-8 

with pthreads 6-5 


complete (synchronous) system calls 
file I/O 5-24 
message passing 3-7 

compress command 5-35 

computation 

overlapping with communication 8-10 

computational kernel of an application 7-3 

compute nodes 1-2 

compute partition 2-2,2-27,2-28 

Concurrent File System 5-1, B-7 

condition variables 6-21 

configuration parameters of the allocator 2-36 

configuring message passing 8-18 

configuring your environment 
for cross-development 2-6 
for online manual pages 2-6 

contiguous memory locations 8-5 

contiguous nodes 2-16,2-40,4-5,4-28 

contiguous partitions 2-31 

control decomposition 7-5 
example 7-14 

controlling application execution with system calls 
4-2 

controlling process 2-24,2-28,2-51,4-4,4-21 
global operations 4-25 
message passing 4-25 
node number of 4-25 
process type of 4-25 

controlling tape devices 5-44 

controlling the application’s execution 
characteristics 2-13 

coprocessor 8-10 

copying processes onto nodes 4-10 
core dumps 4-44 
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core files in PFS file systems 5-4 

correspondents parameter 8-17 

count parameter 3-5 

Courier font vi 

cp command 5-5,5-35 

cpio command 5-35 

cprobeO system call 3-14, A-7, A-31 

cprobex() system call 3-16,3-24, A-9, A-34 

cptape command B-7 

creadO system call 5-15,5-25, A-19, A-45 
synchronization 5-48 

creadv() system call 5-25,5-48, A-19, A-45 

creat() system call 5-31 

createstrucO system call B-14 

creating an application 4-4 

creating partitions 2-39,4-28 

crecv() system call 3-8, A-5, A-29 
message buffering 8-11 

crecvx() system call 3-16,3-24, A-9, A-33 

crestore command B-7 

critical code 3-22 

cross-compilers 2-2 

cross-development facility 1-6 
commands 2-5 

configuring your environment 2-6 

csendO system call 3-8, A-5, A-29 
message buffering 8-11 

csendrecvO system call 3-8, A-5, A-29 

cthreads 6-2 

CTOH...() system calls B-14 
<Ctri-o key 2-23 
<Ctrl-z> key 2-23 


cube control commands B-5 
cube.h file B-9 
cubeinfo command B-5 
cubeinfo() system call B-10 
current partition 2-28 

cwrite() system call 5-15,5-24,5-25,5-31, A-19, 
A-45 

synchronization 5-48 

cwritev() system call 5-25,5-48, A-19, A-46 

D 

data cache 8-5 
data locality 8-10 
data segment 8-15 

dclock() system call 4-50, A-16, A-43, B-16 
dead nodes 2-31,2-48,4-40 
deadlock 8-17 

dealing out data to the nodes 7-3 

debugging applications 2-34 

declaring large arrays 8-8 

decomposition 7-3 

control decomposition 7-5 
domain decomposition 7-3 

default application size 2-17 

default characteristics of a partition 2-29 

default partition 2-2,2-14 
determining 2-15 
listing applications in 2-51 
listing subpartitions of 2-49 
setting 2-14 

showing characteristics of 2-46 
<Det> key 2-23 
DELTA System 4-52 


lndex-6 



Paragon 1 " User's Guide 


Index 


designing a communication strategy 7-6 
designing a parallel application 7-1 
destroying partitions 2-45,4-30 
detecting end-of-file 5-29 
determining your default partition 2-15 
/dev/io0/rmt6 device 5-46 
devices, disk 5-2 
df command 5-7 
diff command 5-35 

differences between iPSC and Paragon B-1 

disk space allocated to a file 5-7,5-32 

disks and file systems 5-2 

displaying file system attributes 5-5 

distributed memory 7-2 

distributing computation among the nodes 7-3 

distributing data among the nodes 7-3 

documents, related vii 

domain decomposition 7-3 
example 7-8 

dot (.) in partition pathnames 2-28 
dot (.) partition (root partition) 2-26 
du command 5-35 
dumping core 4-44 
dynamic algorithm selection 3-27 
dynamic memory allocation 8-8 

E 

eadd() system call 5-37, A-22, A-49 
ecmp() system call 5-37,5-38, A-22, A-49 
ed command 5-35 


ediv() system call 5-37,5-38, A-22, A-49 

effective priority limit 2-35,2-44,2-56 

efficiency of PFS files 8-23 

ellipses (...), in syntax descriptions vii 

emod() system call 5-37,5-38, A-22, A-49 

emul() system call 5-37, A-22, A-49 

end-of-file 5-29 

environment variables 
CFS_MOUNT 5-13 
for compiling and linking 2-6 
for online manual pages 2-6 
IPSC.XDEV B-4 
MANPATH 2-6 

NX_DFLT_PART 2-3,2-12,2-14 
NX_DFLT_SIZE 2-12,2-17,2-22 
of child processes 4-13 
PARAGON_XDEV 2-6,2-9 
PATH 2-6 

envp parameter 4-13 

errno variable 4-42 
with pthreads 6-41 

"error 216 occurred, unknown" error 2-12 

error handling 4-42,8-7 

in parallel file I/O calls 5-26 
with pthreads 6-41 

error messages 2-12 

eseek() system call 5-36, A-21, A-48 
synchronization 5-48 

esize() system call 5-66, A-21, A-48 

esize_t structure 5-37 

estat structure 5-36 

estat() system call 5-66, A-21 

estalh file 5-37 

estatfs structure 5-41 

esub() system call 5-37, A-22, A-49 
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Ethernet interface 1-2 

etos() system call 5-37, A-22, A-49 

ex command 5-35 

example of compiling and linking 2-3 

examples 

iomodes5-17 
matrix’vector 7-11 
nqueens 7-13 
pi 7-7 

pthreads 6-18 
triangle 7-18 

"exceeded allocator configuration parameters" 
error 2-36 

"exceeds partition resources" error 2-12,2-16 

exception mask 4-48 

exceptions 5-31 

execO system call 3-5 
with pthreads 6-9 

executable files in PFS file systems 5-4 

execute (x) permission on a partition 2-33 

executing applications 2-3,2-11 
after cross-compilation 1-6 
controlling 2-13 

execution search path 2-6 

execution timing 4-50 

exit() system call, with pthreads 6-9,6-42 

extended arithmetic 5-37 

extended files 5-33 

extended receive and probe 3-24 

F 

.F extension 2-9 


f77 command 2-4,2-5, A-1 
-host switch B-10 
-I switch 2-9,2-10 
-L switch 2-9,2-10 
-Mquad switch B-4 
-node switch B-4 
-nx switch 2-5, B-4 
order of switches 2-10 
-p switch B-4 

failed nodes 2-31,2-48,4-40 
fault LEDs 1-3 
fcntIO system call 5-4,5-34 
fcube.h file B-9 
festatO system call 5-36, A-21 
FFT library 8-6 
fg command 2-23 
fgetpos() system call 5-34 
FIFO size 8-12 

file I/O, parallel (see also "parallel file I/O") 5-1 

file pointers 5-14 

file system blocks 8-25 

file systems 5-2 
attributes of 5-5 

getting information about PFS file systems 5-39 
filelD parameter 5-8 
filenames, length of 5-3 
files 

extended 5-33 
file descriptors 5-8 
file pointers 5-14 
moving 5-29 

maximum open at once 5-4 
size of 5-7,5-32 

find command 5-35 

fixed-size records 5-16 
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flick() system call 4-50, A-16, A-43 

floating-point control calls 4-46 

flock() system call 5-4 

flockfileO system call 6-10 

flow control of messages 8-13 

flushing Fortran buffered I/O 5-30 

flushmsgO system call 4-52, A-17, A-43, B-16 

fnx.h file 2-8 

force types 3-6, B-2 

forceflushO system call 5-30, A-47 

forfiush() system call 5-30, A-47 

fork() system call 2-6,3-5,4-10,5-13 
with pthreads 6-9 

forking processes onto nodes 4-10 

form="formatted" parameter 5-11 

form="unformatted" parameter 5-11 

formatted files 5-11 

fort.nnn files 5-13 

Fortran programs 
error handling 4-42 
file I/O on parallel files 5-24 
flushing buffered I/O 5-30 
including fnx.h 2-8 
memory access considerations 8-5 
message passing with Fortran commons 3-17 
opening new files 5-12 
opening parallel files 5-11 
parallel file I/O calls 5-25 
preprocessing 2-9 
sequential files 5-26 
unformatted files 5-26 
units 5-8 

fpgetmaskO system call 4-46, A-16 
fpgetroundO system call 4-46, A-16 
fpgetstickyO system call 4-46, A-16 


fpsetmaskO system call 4-46,5-31, A-16, A-42 

fpsetroundO system call 4-46, A-16 

fpsetstickyO system call 4-46, A-16 

fread() system call 8-24 

free nodes 2-48 

front panel LEDs 1-3 

fseekO system call 5-34 

fsetpos() system call 5-34 

fsplit command A-3 

fstatO system call 5-34 

fstatfsO system call 5-41,8-25 

fstatpfsO system call 5-39,8-26, A-23 

ftellO system call 5-34 

FTNxxxxxxxx.nn files 5-13 

ftp command 1-6,2-7,5-35 

ftruncateO system call 5-34 

full stripe size 8-25 

funlockfileO system call 6-10 

G 

gang scheduling 2-35,2-43,2-55 
with pthreads 6-4 

Gauss-Seidel method 8-11 

gcol() system call 3-27, A-10, A-35 

gcolx() system call 3-27,7-6,7-13, A-10, A-35 

gdhigh() system call 3-27, A-10, A-35 

gdlowO system call 3-27, A-10, A-35 

gdprodO system call 3-27, A-10, A-35 

gdsum() system call 3-27,3-28,7-6, A-10, A-35 

general cancelability 6-28 
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getcube command B-5 

getcube() system call B-10 

getiphosts() system call B-16 

getmntinfo() system call 5-39 

getpfsinfoO system call 5-39, A-23 

getrusageO system call 8-9 

getting information about PFS file systems 5-39 

GFLOPS 8-2 

giand() system call 3-27, A-10, A-35 
gigabyte files 5-33 

gihigh() system call 3-27, A-10, A-36 
gilow() system call 3-27, A-11, A-36 
ginv() system call 4-52, A-17, A-43 
gior() system call 3-27, A-11, A-36 
giprod() system call 3-27, A-11, A-36 
gisumO system call 3-27, A-11, A-36 
give_threshold parameter 8-18 
gixor() system call B-16 
gland() system call 3-27, A-11, A-36 
global clock 4-50 

global operations 3-4,3-27,5-9,5-13,5-14,5-48, 
7-6 

and controlling process 4-25 
effect on-noc switch 8-22 
with -on switch 2-20 
with pthreads 6-4, 6-37 

global predicate variables 6-21 

glor() system call 3-27, A-11, A-36 

glxor() system call B-17 

gopen() system call 5-9,8-23, A-18, A-45 
synchronization 5-48 
with pthreads 6-4 


gopf() system call 3-27, A-11, A-37 

gprof command 8-2 

gray() system call 4-52, A-17, A-43 

green LEDs 1-3 

group of a partition 2-32,2-56 

groups of processes 4-22 

gsendx() system call 3-9, A-5, A-29 

gshigh() system call 3-27, A-11, A-37 

gslowO system call 3-27, A-11, A-37 

gsprodO system call 3-27, A-12, A-37 

gssumO system call 3-27, A-12, A-37 

gsyncO system call 3-27, A-12, A-37 
with open() B-16 
with pthreads 6-4 

-gth switch 8-18 

H 

handled message-passing calls 3-7 

handled types 3-19 

handler() system call B-17 

handling errors 4-42,8-7 
with pthreads 6-41 

hard disk 5-2 

hardware 1-2 

hardware failures 2-31 

heap 8-15 

"hello, world" program 2-4 
hierarchical partition structure 2-28 
host calls B-9 
-host switch B-10 
hparam parameter 3-21 
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hrecv() system call 3-18, A-8, A-32 
with pthreads 6-38 

hrecvx() system call 3-24, A-9, A-34 

hsend() system call 3-18,3-20, A-8, A-32 
with pthreads 6-38 

hsendrecvO system call 3-18,3-20, A-8, A-32 
hsendxO system call 3-20, A-8, A-33 
HTOC...() system calls B-14 
hwclockQ system call 4-52, A-17, A-43 


I/O buffers 8-25 

I/O calls, efficiency of 8-24 

I/O IDs 5-27 

I/O interfaces 1-2 

I/O modes 5-9,5-13 
efficiency of 8-24 
example 5-17 

inheritance across forkO 5-13 
M_GLOBAL5-17 
M_LOG 5-15 
M.RECORD 5-16,8-26 
M_SYNC 5-15 
MJJNIX 5-14 
standard 1/0 2-12 
synchronization of 5-48 

I/O nodes 5-2 

I/O partition 2-26 

I/O redirection 2-12 

I/O request size 8-23,8-25 

I/O to parallel files 5-24 

I/O, parallel (see also "parallel file I/O") 5-1 


i860 microprocessor 1-2,8-5 
cache line 8-12 
data cache 8-5 
FIFO size 8-12 

floating-point control registers 4-47 
instruction cache 8-6 
physical memory page 8-12 

icc command 2-5, A-1 

environment variables 2-6 
-I switch 2-9,2-10 
-Knoieee switch 8-4 
-L switch 2-9,2-10 
-MnostrideO switch 8-4 
-Mquad switch B-4 
-Mvect switch 8-4 
-node switch B-4 
-nx switch 2-5, B-4 
-O switch 8-4 
order of switches 2-10 
-p switch B-4 

ID of a message 3-6 

IEEE math library 8-4 

IEEE NaN 4-47 

if/else blocks, efficiency of 8-6 

if77 command 2-5, A-1 

environment variables 2-6 
-I switch 2-9,2-10 
-Knoieee switch 8-4 
-L switch 2-9, 2-10 
-Ikmath switch 7-13 
-MnostrideO switch 8-4 
-Mquad switch B-4 
-Mvect switch 8-4 
-node switch B-4 
-nx switch 2-5, B-4 
-O switch 8-4 
order of switches 2-10 
-p switch B-4 

image enhancement 7-3 

improving performance 8-1 
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inactive applications 2-35 

include directories 2-8 

include files 
cube.h B-9 
estath 5-37 
fCube.h B-9 
fnx.h 2-8 
mtio.h 5-44 
nx.h 2-8 

incomplete (asynchronous) system calls 
file I/O 5-24 
message passing 3-7 

increasing problem size 8-5 

increasing the size of a file 5-7,5-32 

info parameter 3-25 

info...() system calls, with pthreads 6-38 

infocount() system call 3-15, A-7, A-31 

infonodeO system call 3-15, A-7, A-31 

infopid() system call 4-52, A-17, A-43 

infoptypeO system call 3-15, A-7, A-31 

information about messages 3-15 

infotypeO system call 3-15, A-7, A-31 

innermost loops 8-4 

instruction cache 8-6 

instruction pointer 6-1 

Intel supercomputer 
hardware 1-2 
software 1-4 
using commands on 2-2 

interactive applications 2-34 

interconnect network 1-2 

interfaces 1-2 

interrupt key 2-23 


interrupts 

preventing 3-22 

treating messages as interrupts 3-18 
INVALID.PTYPE constant 3-5 
ioctl() system call 5-44 

iodoneO system call 5-24,5-27,5-28, A-19, A-46 

iomode() system call 5-13, A-18, A-45 

iowait() system call 5-24,5-27,5-28, A-19, A-46 

iprobeO system call 3-14, A-7, A-31 

iprobexO system call 3-16,3-24, A-9, A-34 

iPSC system 

CFS compatibility 5-1 
commands B-5 
compatibility calls 4-52 
compatibility with B-1 
compilers 2-7, B-4 

IPSC_XDEV environment variable B-4 
system calls B-9 

ireadO system call 5-27, A-19, A-46, B-15 
synchronization 5-48 

ireadv() system call 5-27,5-48, A-19, A-46 

irecv() system call 3-10, A-6, A-30 

irecvxO system call 3-24, A-9, A-33 

isendO system call 3-10, A-6, A-30 

isendrecvO system call 3-10, A-6, A-30 

iseof() system call 5-15,5-24,5-26,5-29, A-20, 
A-47 

synchronization 5-48 
isnanO system call 4-46, A-16 
isnand() system call 4-46, A-16 
isnanf() system call 4-46, A-16 
italic text vi 

iwriteO system call 5-27,5-31, A-19, A-46, B-15 
synchronization 5-48 
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iwritev() system call 5-27,5-48, A-19, A-46 

K 

kernel of an application 7-3 
kernel threads 6-3,6-16 
kill command 1-4,2-24,2-51 
kill() system call 4-15,4-23 
killcube command B-6 

killcubeO system call 4-52, A-17, A-43, B-10 
killing application processes 4-23 
killprocO system call 4-52, A-17, A-44, B-10 
killsyslog() system call B-11 
-Knoieee switch 8-4 

L 

led() system call 4-52, A-16, A-43 
LEDs 1-3 

length of a filename or pathname 5-3 

length of a message 3-5,3-15 

less command B-8 

lestat() system call 5-36, A-21 

libc_r.a 6-2,6-6 

error handling 6-42 

lib-coff directory 2-8 

libkmath.a 8-6 

libmach.a 6-5 

libnx.a 2-5 

with pthreads 6-3 

libpthreads.a 6-2,6-11 
error handling 6-42 


libraries 

Basic Math Library 8-6 
BLAS 7-13 

command-line switches 2-10 
IEEE math library 8-4 
libkmath.a 7-13 
libnx.a 2-5 

pthreads package 6-5 
search path for 2-10 
Signal Processing Library 8-6 
specifying 2-8 

libsignaLa 8-6 

life of a process type 3-5 

limitations of PFS 5-4 

limitations of pthreads 6-3 

link switches 2-10 

iinkO system call 5-31 

linking an application 2-3 
single-pass linker 2-10 
specifying library pathnames 2-8 
with pthreads 6-5 

listing partitions 2-49 

listing the applications in a partition 2-51 

-Ikmath switch 7-13 

-Inx switch 2-6 

effect on execution 2-11 
order of 2-10 
with pthreads 6-5 

load balancing 2-33,7-3 

load command B-6 

load() system call 4-52, A-17, A-44 

loading processes onto nodes 4-11 

locality of data 8-10 

locking a process in memory 8-15 

locking and unlocking pthreads 6-16 
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locking data into memory 8-15 
logical node numbers 2-30 
loops, innermost 8-4 
loops, size of 8-6 
Is command 5-5,5-35 

lseek() system call 5-15,5-17,5-26,5-29,5-34, 
A-20, A-47 
synchronization 5-48 

Isize command 5-7, A-3 

lsize() system call 5-32, A-20, A-47 

Ispart command 2-31,2-49, A-2 
-r switch 2-50 

lstat() system call 5-34 

M 

M_GLOBAL I/O mode 5-17, 8-24 

M_LOG I/O mode 5-15 

M.RECORD I/O mode 5-16, 8-24, 8-26 

M_SYNC I/O mode 5-15, 8-24 

MJJNIX I/O mode 5-14,8-24 

Mach kernel interface 6-5 

Mach threads 6-2 

madviseO system call 5-34 

magnetic tapes, controlling 5-44 

main thread 6-12 

maintaining data locality 8-10 

makewhatis command B-7 

making partitions 2-39,4-28 

making the program independent of the number of 
nodes 7-5 

mallocQ system call 8-8,8-13,8-15 


manager-worker decomposition 7-5,7-14 

managing partitions 2-25 
with system calls 4-27 

managing running applications 2-23 

manpath command B-8 

MANPATH environment variable 2-6 

manual pages, configuring your environment for 2-6 

masktrap() system call 3-22, A-8, A-33 

math library, IEEE 8-4 

matrix*vector example 7-11 

maximum capacity of a PFS file system 5-3 

maximum length of a filename or pathname 5-3 

maximum number of open files 5-4 

maximum size of a PFS file 5-3 

-mbf switch 8-16 

mclock() system call 4-52, A-17, A-44 

-mea switch 8-17 

memory 

accessing contiguous memory locations 8-5 
allocated to message buffers 8-16 
distributed 7-2 
dynamic allocation 8-8 
locking 8-15 

locking data into memory 8-15 
of nodes 1-2 
physical 1-5 

physical pages 8-5,8-12 
static allocation 8-8 
virtual 1-5,8-3,8-15 

memory_each parameter 8-17 

memory_export ipc_option 8-19 

merging message IDs 3-13 

message buffers 8-16 

message coprocessor 8-10 
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message handlers 3-20 

message_buffer parameter 8-16 

message-passing configuration switches 8-18 

message-passing flow control 8-13 

message-passing system calls 3-1 
with pthreads 6-4,6-37 

messages 1-1,7-2 
as interrupts 3-18 
asynchronous calls 3-10 
buffers 3-14,8-11 
aligning 8-12 

configuration options 8-18 
designing a communication strategy 7-6 
exchanging with controlling process 4-25 
force types 3-6 

getting information about 3-15 
handled types 3-19 

memory allocated to message buffers 8-16 

merging message IDs 3-13 

message characteristics 3-5 

message ID 3-6 

message IDs 3-10 

message length 3-5,3-15 

message order 3-7 

message passing with Fortran commons 3-17 
message type 3-6,3-15 
names of message-passing calls 3-7 
pending messages 3-14,8-11 
performance improvement techniques 8-7, 
8-21 

pthreads 6-37 

releasing message IDs 3-12 
synchronous calls 3-8 
typesel masks 3-6 
zero-length messages 3-6 

-mex switch 8-19 

MFLOPS 8-2 

miscellaneous system calls 4-50 
mkcfs command B-7 
mkdev command B-8 


mkdirO system call 5-31 

mknod() system call 5-31 

mkpart command 2-29,2-39, A-2 
-epl switch 2-44 
-mod switch 2-42 
-nd switch 2-40 
-rq switch 2-43 
-sps switch 2-43 
-ss switch 2-43 
-sz switch 2-40 

mmapO system call 5-4,5-34 

-MnostrideO switch 8-4 

modes for I/O 5-13 

synchronization of 5-48 

modes of a partition 2-32,2-42,2-56 

monospace text vi 

more command 5-35 

mount points 5-2 

moving the file pointer 5-29 

mp_switches 8-18 

mprotectO system call 5-34 

-Mquad switch B-4 

msgcancel() system call 4-52, A-17, A-44 
msgdoneO system call 3-10,3-16, A-6, A-30 
msgignoreO system call 3-10, A-6, A-31 
msginfo array 3-16,3-25 
msgmergeO system call 3-13, A-6, A-31 
msgwaitO system call 3-10,3-16, A-6, A-30 
msync() system call 5-34 
MT operations 5-44 
mtio.h file 5-44 

multi-node performance 8-7,8-21 
multiple nodes 3-9 
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multiple programs in an application 2-21 
munmap() system call 5-34 
mutexes 6-1,6-16 
mv command 5-5,5-35 
-Mvect switch 8-4 

myapp (any application) command 2-11 

myapp.c 2-4 

myapp.f2-4 

myhost() system call 4-25, A-4, A-28 
mynode() system call 2-17,3-3,8-2, A-4, A-28 
mypartO system call 2-17,4-17,4-52, A-17, A-44 
mypid() system call 4-52, A-17, A-44 
myptypeO system call 2-18,3-4,8-2, A-4, A-28 

N 

name of a partition 2-30,2-56 

named commons in message passing 3-17 

names of message-passing calls 3-7 

NaN (Not-a-Number) 4-47 

native commands 2-5 

new features in Paragon OSF/1 B-2 

new files 5-12 

newfs command 5-35 

newserver command B-6 

newserverO system call B-11 

NEXTPATHO macro 5-40 

NFS (Network File System) 1-6,2-7,5-2 
accessing PFS files 5-4 
parallel I/O to 5-24 

-noc switch 8-17 

node interconnect network 1-2 


node numbers 3-3 
in filenames 5-31 
in overlapping partitions 2-32 
logical 2-30 

of a received or pending message 3-15 
of controlling process 4-25 
physical 2-30 
within a partition 2-30 

node parameter 3-3 

_NODE preprocessor symbol 2-5 

node programs 

error handling 8-7 

?node switch B-4 

nodedimO system call 4-52, A-18, A-44 

nodes 1-1,1-2,7-2 

allocated to a partition 2-30,2-40 
allocated to an application 2-15,4-4 
compute nodes 1-2 
contiguous 2-16,2-40,4-5,4-28 
copying processes onto nodes 4-10 
designing a communication strategy 7-6 
free 2-48 
I/O nodes 5-2 
load balancing 7-3 
loading processes onto nodes 4-11 
making programs independent of number of 
nodes 7-5 
node numbers 3-3 
node topologies 7-6 
operating system 1-4 
partitions 2-25 

running application processes on a subset 2-18 

service nodes 1-2 

unusable nodes 2-31,2-48,4-40 

nodesel parameter 3-3 

nodespecs 2-19,2-40 

noieee switch 8-4 

noncontiguous nodes 2-16,2-40,4-5,4-28 
noncontiguous partitions 2-31 
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non-parallel programs 2-11,2-27 

nostrideO switch 8-4 

Not-a-Number (NaN) 4-47 

notational conventions used in the manual vi 

nqueens example 7-13 

nsh command B-9 

number of bytes read or written 5-26 

numbers, extended 5-37 

numerical methods 8-1 

numnodesO system call 2-17,3-3, A-4, A-28 

-nx switch 2-5, B-4 

actions performed by 4-3 
and nxJnitveO 4-4 
command-line switches 2-13,8-18 
effect on execution 2-11 
with pthreads 6-5 

nx.h file 2-8 

nx_...() system calls, error handling of 4-42 

nx_app_nodesO system call 4-16, A-13, A-39 

nx_app_rect() system call 2-17,4-16,8-10, A-13, 
A-39 

nx_chpart_epl() system call 4-36, A-14, A-41 

nx_chpart_mod() system call 4-36, A-14, A-41 

nx_chpart_nameO system call 4-36, A-14, A-41 

nx_chpart_owner() system call 4-36, A-15, A-41 

nx_chpart_rq() system call 4-36, A-14, A-41 

nx_chpart_schedO system call 4-36, A-15, A-41 

NX_DFLT_PART environment variable 2-3,2-12, 
2-14 

NX_DFLT_SIZE environment variable 2-12,2-17, 
2-22 

nx_empty_nodesO system call 4-40, A-15, A-42 
nx_failed_nodes() system call 4-40, A-15, A-42 


nxJnitveO system call 4-3,4-4,4-21, A-12, A-38 
linking 2-6 
with pthreads 6-39 

nx_initve_rect() system call 4-3,4-7, A-12, A-38 

nxJoadO system call 2-20,3-5,4-3,4-11,4-14, 
A-13, A-39 

nxJoadveO system call 2-20,3-5,4-3,4-13,4-14, 
A-13, A-39 

nx_mkpart() system call 4-28, A-14, A-40 

nx_mkpart_mapO system call 4-28, A-14, A-40 

nx_mkpart_rect() system call 4-28, A-14, A-40 

nx_nforkO system call 2-20,3-5,4-3,4-10,4-14, 
A-12, A-38 
with pthreads 6-39 

nx_part_attr() system call 4-31, A-14, A-40 

nx_part_nodes() system call 4-31, A-14, A-41 

nx_perror() system call 4-42, A-15, A-42 
with pthreads 6-42 

nx_pri() system call 4-3,4-9, A-12, A-38 
nx_pspart() system call 4-16, A-13 
nx_rmpart() system call 4-30, A-14, A-40 
nx_waitall() system call 4-3,4-14,4-45, A-13, A-39 


O 

-on switch 2-19 

open() system call 5-10,5-11,5-31,8-23, B-15 

opening parallel files 5-9 
### in filenames 5-31 
special considerations for Fortran 5-11 
with standard operations 5-11 

operating system 1-4 

"Operation not supported by this file system" error 
5-4 
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compiler 8-3 

optimizations 8-3 

order of application switches 2-13 

order of compiler switches 2-10 

order of messages 3-7 

organization of the manual v 

OSF/1 operating system 1-4 
commands 2-2 

OSF/1 PIDs 4-15 

other system calls 4-1 

overlapping computation and communication 8-10 
overlapping partitions or applications 2-36 
owner of a partition 2-32,2-56 

P 

-p switch B-4 

packetjsize parameter 8-16 

packetization 8-16 

padding in common blocks 3-17 

pages of physical memory 8-5 

paging 8-3 

preventing 8-15 

Paragon OSF/1 operating system 1-4 
commands 2-1 

message-passing system calls 3-1 
new features B-2 
other system calls 4-1 
parallel file I/O 5-1 
programming model 7-2 

Paragon system 
hardware 1-2 
software 1-4 

PARAGON_XDEV environment variable 2-6,2-9 


$PARAGON_XDEV/paragon directory 2-8 

ParaGraph performance visualization tool 8-10 

parallel applications 1-1,2-1 

parallel file I/O 5-1 

### in filenames 5-31 
asynchronous I/O calls 5-27 
efficiency of 8-24 
closing files 5-28 
detecting end-of-file 5-29 
efficiency of 8-24 
error handling 5-26 
file pointers 5-14 

flushing Fortran buffered I/O 5-30 
formatted vs. unformatted I/O 5-11 
I/O modes 5-9,5-13 
efficiency of 8-24 
I/O performance 8-23 
in Fortran programs 5-25 
increasing the size of a file 5-7,5-32 
manipulating extended files 5-36 
moving the file pointer 5-29 
new files 5-12 
opening files 5-9 

with standard operations 5-11 
reading and writing files 5-24 
scattered read and write 5-25 
special considerations for Fortran 5-11 
synchronizing calls 5-9,5-13,5-14 
synchronizing operations 5-48 
synchronous I/O calls 5-25 
system calls 5-8 
tapes, controlling 5-44 
to NFS files 5-24 
to the user’s terminal 5-31 
unnamed files 5-13 
with pthreads 6-4,6-38 

parallel file system (see also "PFS") 5-1 

parallel programming techniques 7-2 

parent partition 2-29 

"partition permission denied" error 2-12 
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partitions 1-6,2-2,2-25 

allocating nodes to applications 2-15 
changing partition characteristics 2-54,4-36 
characteristics 2-29 
child partitions 2-29,2-34 
compute partition 2-2,2-27,2-28 
contiguous and noncontiguous 2-31 
contiguous nodes 2-40 
current 2-28 

default characteristics 2-29 
default partition 2-2,2-14 
determining 2-15 
setting 2-14 

dot (.) partition (root partition) 2-26 
effective priority limit 2-35,2-44,2-56 
error messages 2-14 
execute (x) permission 2-33 
free nodes 2-48 

gang-scheduled 2-35,2-43,2-55 
hierarchical structure 2-28 
I/O partition 2-26 
listing 2-49 

listing the applications in a partition 2-51 
making partitions 2-39,4-28 
managing 2-25 

managing with system calls 4-27 
name of a partition 2-30,2-56 
nodes 2-30 

nodes allocated to a partition 2-40 

overlapping 2-36 

owner and group 2-32,2-56 

parent partition 2-29 

pathnames 2-28 

permission bits 2-32,2-42,2-56 

priority 2-35 

protection modes 2-32,2-42,2-56 
read (r) permission 2-33 
removing partitions 2-45,4-30 
rollin quantum 2-34,2-35,2-43,2-55 
root partition 2-26,2-29 
shape of 2-26 

running applications in 2-22 
scheduling characteristics 2-33,2-43 
service partition 2-2,2-24,2-27 
showing partition characteristics 2-46 


space-shared 2-37,2-43,2-56 
special 2-26 

standard-scheduled 2-34,2-43 
subpartitions 2-29,2-34 
unusable nodes 2-31,2-48,4-40 
write (w) permission 2-33 

passing information to the handler 3-20 

PATH environment variable 2-6 

pathnames of partitions 2-28 

pathnames, length of 5-3 

pending messages 3-14,8-11 
getting information about 3-15 

perfectly-parallel applications 1-6,7-2 

performance improvement techniques 8-1 

performance of PFS files 8-23 

performance visualization 8-10 

performing extended arithmetic 5-37 

permissions of a partition 2-32,2-42,2-56 

per-node vector size 8-5 

perror() system call, with pthreads 6-9,6-42 


lndex-19 



Index 


Paragon™ User's Guide 


PFS5-3 

accessing via NFS 5-4 
commands 5-5 
core files in 5-4 
executable files in 5-4 
file systems 5-3 

filename and pathname length 5-3 
files 5-3 

getting information about PFS file systems 5-39 

limitations of 5-4 

manipulating extended files 5-36 

maximum capacity 5-3 

maximum file size 5-3 

maximum number of open files 5-4 

mount points 5-2 

opening PFS files 5-9 

with standard operations 5-11 
performance 8-23 
special files in 5-3 
stripe directories 5-3 
stripe units 5-3 

/pfs file system 5-5 

PFS file systems 
block size 8-25 
striping 8-25 

pfsmntinfo structure 5-39 

physical memory 1-5 

physical memory page 8-12 

physical memory pages 8-5 

physical nodes 2-30 

physical topology 7-6 

pi example 7-7 

PIDs (process IDs) 4-14 

contrasted with process types 4-15 

-pkt switch 8-16 

-plk switch 8-3, 8-15 

plockO system call 8-15 

plogon and plogoff commands B-8 


plogon() and plogoff() system calls B-17 
pmake command A-3 
-pn switch 2-22 

pointers to message buffers 8-13 
porting iPSC programs B-1 
porting serial codes 7-5 
POSIX threads 6-1 
preallocating disk space 5-33 
preprocessing a Fortran program 2-9 

preprocessor symbol_NODE 2-5 

preventing interrupts 3-22 
-pri switch 2-17 
priority 

effective priority limit of a partition 2-35,2-44, 
2-56 

of a partition 2-35 
of a process 2-35 
of an application 2-17,2-35,4-9 

probing for pending messages 3-14 
extended 3-24 

problem decomposition 7-3 

problem size 8-5 

process group IDs 2-51 

process group leaders 4-4,4-22 

process IDs (PIDs) 4-14 

process locking 8-15 

process types 3-4 
changing 3-4 

contrasted with OSF/1 PIDs 4-15 
INVALID_PTYPE 3-5 
life of 3-5 

of a received or pending message 3-15 
of an application 2-18 
of controlling process 4-25 
with pthreads 3-5,6-37 


index-20 





Paragon™ User's Guide 


Index 


processes 

arbitration between 2-33 

characteristics 3-3 

child processes 4-10,4-11,4-14 

controlling process 2-24,2-28,2-51,4-4,4-21 

copying processes onto nodes 4-10 

loading processes onto nodes 4-11 

PIDs (process IDs) 4-14 

priority of 2-35 

process groups 4-22 

process types 3-4 

threads 1-1,6-1 

waiting for application processes 4-14 

processor time 2-17 

processors 1-1 

prof command 8-2 

profiling B-4 

profiling tools 8-2 

program development tools 1-6 

programming model 1-6,7-2 

programming techniques 7-2 

programs, non-parallel 2-1,2-27 

protection modes of a partition 2-32, 2 - 42 , 2-56 

ps command 2-24 

pspart command 2-51, A-2 
-r switch 2-54 

-pt switch 2-18,3-4 

pthread_attr_create() system call 6-15, A-24 
pthread_attr_deleteO system call 6-15, A-24 
pthread_attr_jgetstacksize() system call 6-15, A-24 
pthread_attr_setstacksize() system call 6-15, A-24 
pthread_cancel() system call 6-28, A-26 
pthread_cleanup_popO system call 6-32, A-27 
pthread_deanup_push() system call 6-32, A-27 


pthread_cond_broadcastO system call 6-21, A-26 
pthread_cond_destroyO system call 6-21, A-26 
pthread_cond_init() system call 6-21, A-26 
pthread_cond_signalO system call 6-21, A-26 
pthread_cond_timedwait() system call 6-21, A-26 
pthread_cond_wait() system call 6-21, A-26 
pthread_condattr_create() system call 6-23, A-26 
pthread_condattr_deleteO system call 6-23, A-26 
pthread_create() system call 6-13, A-24 
pthreadjdetachO system call 6-13, A-24 
pthread_equal() system call 6-13, A-24 
pthread_exit() system call 6-13, A-24 
pthread_getspecific() system call 6-33, A-27 
pthreadJoin() system call 6-13, A-24 
pthread_keycreate() system call 6-33, A-27 
pthread_mutex_destroyO system call 6-16, A-25 
pthread_mutex_init() system call 6-16, A-25 
pthread_mutex_lock() system call 6-16, A-25 
pthread_mutex_trylock() system call 6-16, A-25 
pthread_mutex_unlock() system call 6-16, A-25 
pthread_mutexattr_create() system call 6-17, A-25 
pthread_mutexattr_deleteO system call 6-17, A-25 
pthread_once() system call 6-34, A-27 
pthread_self() system call 6-13, A-24 
pthread_setasynccancel() system call 6-28, A-26 
pthread_setcancel() system call 6-28, A-26 
pthread_setspecific() system call 6-33, A-27 
pthread_testcancel() system call 6-28, A-26 
pthread_yield() system call 6-13, A-24 


lndex-21 



Index 


Paragon™ User’s Guide 


pthreads 

attributes of 6-15 
Bourne shell 6-39 
canceling 6-28 
cleanup routines 6-32 
compiling and linking 6-5 
condition variables 6-21 
data types and symbols 6-11 
error handling 6-41 
example program 6-18 
execution of 6-13 
file I/O 6-38 

global predicate variables 6-21 

kernel threads 6-3 

keys 6-33 

libraries 6-2 

limitations of 6-3 

locking and unlocking 6-16 

main thread 6-12 

message-passing calls 6-37 

mutexes6-16 

non-thread-safe code 6-37 

priority of 6-15 

process types 3-5 

pthreads library calls 6-11 

pthreads package 6-1 

pthread-specific data objects 6-33 

recommended safe operating environment 6-4 

reentrant C library 6-6 

signals 6-34,6-39 

stack size of 6-15 

synchronization of 6-21 

ptype (see also "process type") 2-18 

ptype parameter 3-4 

ptypesel parameter 3-4 

Q 

quad-alignment B-4 
queens example 7-13 
quick example 2-3 


quotaon command 5-4 

R 

_r calls 6-10 

RAID (Redundant Array of Inexpensive Disks) 5-2 

rar command B-8 

ras command B-8 

rcc command B-8 

rcmd command 1-6 

rep command 1-6,2-7,5-35 

read (r) permission on a partition 2-33 

read statement 5-24 

read() system call 5-25 
with pthreads 6-4 

reading files in parallel 5-24 

readlinkO system call 5-31 

readv() system call 5-25 

rebootcube command B-9 

receiving messages 3-8 
extended 3-24 

record size 5-16 

rectangular applications 2-16 

rectangular dimensions of an application 4-16 

recursively listing applications in subpartitions 2-54 

recursively listing subpartitions 2-50 

recursively removing subpartitions 2-45 

red LEDs 1-3 

redirecting I/O 2-12 

reentrant C library 6-2,6-6 

_REENTRANT preprocessor symbol 6-5,6-41 

reentrant software 6-1 


lndex-22 



Paragon™ User’s Guide 


Index 


related documents vii 

relative partition pathname 2-28 

relcube command B-6 

relcubeO system call B-11 

releasing control of the processor 4-50 

releasing I/O IDs 5-28 

releasing message IDs 3-12 

relstrucO system call B-14 

remote host commands B-8 

removing partitions 2-45,4-30 

renameO system call 5-31 

repeated use of system calls 8-2 

"request overlaps with nodes in use" error 2-12, 
2-16,2-34 

request size 8-23,8-25 
resources 6-1 

restrictvolO system call 4-52, A-18, A-44 

rewinding a tape 5-46 

rf77 command B-8 

ring topology 7-6 

rid command B-8 

rlogin command 1-5 

rm command 5-35 

rmdir() system call 5-31 

rmpart command 2-45, A-2 
-f switch 2-45 
-r switch 2-45 

rollin and rollout 2-2,4-50 

rollin quantum 2-34,2-35 
of a partition 2-43,2-55 

root account 2-26,2-33,2-56 


root partition 2-26,2-29 
shape of 2-26 

rounding mode 4-47 

RPM global clock 4-50 

rsh command 1-6 

running a program on a subset of the nodes 2-18 

running applications 2-11 

consisting of multiple programs 2-21 
in a particular partition 2-22 
removing partitions containing 2-45 

S 

sat command A-3 

scattered read and write 5-25 

scheduling characteristics of a partition 2-33,2-43 

scheduling mechanisms 1-1 

"scheduling parameters conflict with allocator 
configuration" error 2-36 

scratch files 5-13 

SCSI interface 1-2,5-2 

-set switch 8-18 

sdot() BLAS function 7-13 

search path 2-6 

search path for libraries 2-10 

seeking on a file 5-29 

send_avail value 8-17 

send_count parameter 8-17 

send_threshold parameter 8-17 

sending messages 3-8 

sending to multiple nodes 3-9 

separating the user interface from the computation 
7-3 


lndex-23 



Index 


Paragon User's Guide 


sequential files 5-26 
serial codes, porting 7-5 
serializing calls 8-9 
service nodes 1-2 

service partition 2-2,2-11,2-24,2-27 

setiomodeO system call 5-13, A-18, A-45 
synchronization 5-48 
with pthreads 6-4 

setiphost() system call B-17 

setpartalias2-15 

setpgrpO system call B-17 

setpid() system call B-11 

setptypeO system call 3-4, A-4, A-28 
in controlling process 4-25 

setsyslogO system call B-11 

setting your default partition 2-14 

sh command 2-24 
with pthreads 6-39 

shadow buffers 8-11 

shape of an application 4-16 

shape of the root partition 2-26 

shell 2-27 

shell scripts 2-24 

shepherd process 2-24 

showfs command 5-5,8-26, A-3 

showing partition characteristics 2-46 

showpart command 2-29,2-31,2-46,2-50, A-2 
-f switch 2-48 

showvol command B-7 

Signal Processing Library (libsignal.a) 8-6 

signal() system call 4-44 

signals, with pthreads 6-5,6-34,6-39 


sigwait() system call 6-5,6-34, A-27 

single program multiple data (SPMD) programming 
model 1-6, 7-2 

single system image 1-4,7-2 

single-node performance 8-2 

64-bit integers 5-37 

size 

of a file 5-7,5-32 
of a message 3-5 
of a packet 8-16 
of a partition 2-40 
of an application 2-15 
rectangular 2-16 

size of a problem 8-5 

sizeof operator 3-6 

sleep system call, with pthreads 6-9 

software failures 2-31 

space sharing 2-37,2-43,2-56 

special partitions 2-26 

specifying application priority 2-17 

specifying application size 2-15 

specifying nodes allocated to a partition 2-40 

specifying process type 2-18 

speed of calculation 8-2 

speedup of a parallel program 7-2 

SRM B-1 

stack 6-1,8-15 

stack size of pthreads 6-15 

standard include directory 2-9 

standard input and output, redirecting 2-12 

standard scheduling 2-34,2-43 

star command B-7 


lndex-24 



Paragon™ User's Guide 


Index 


startcube command B-6 
start-up routine 2-5 
stat() system call 5-31,5-34,8-9 
statfsO system call 5-31,5-41,8-25 
static memory allocation 8-8 
statpfs structure 5-39 
statpfs() system call 5-39,8-26, A-23 
status of a tape device 5-46 
status="new" parameter 5-12 
status="scratch" parameter 5-13 
-sth switch 8-18 
sticky flags 4-48 

stoe() system call 5-37,5-38, A-22, A-49 

stream command B-7 

stripe directories 5-3,8-25 

stripe example 8-26 

stripe factor 8-25 

stripe unit 5-3,8-25 

strip-mining loops 8-5 

structures, padding of 8-13 

subpartitions 2-29,2-34 
creating 2-39 
listing 2-49 

listing the applications in a subpartition 2-54 
removing 2-45 

removing partitions containing 2-45 
succ() function 7-7 

summaries of commands and system calls A-1 

supercomputer 
hardware 1-2 
software 1-4 
using commands on 2-2 

suspend key 2-23 


SVR3.2 B-9 

switches 

compiler 2-8 

compiler optimization 8-3 

-gth switch 8-18 

-host switch B-10 

in nxJnitveO 4-6 

-mbf switch 8-16 

-mea switch 8-17 

-mex switch 8-19 

mp_switches 8-18 

-Mquad switch B-4 

-noc switch 8-17 

-node switch B-4 

-nx switch 2-5,2-13,8-18, B-4 

-on switch 2-19 

order of switches 2-13 

-p switch B-4 

-pkt switch 8-16 

-plk switch 8-3,8-15 

-pn switch 2-22 

-pri switch 2-17 

-pt switch 2-18,3-4 

-set switch 8-18 

-sth switch 8-18 

-sz switch 2-16,3-3 

symlink() system call 5-31 

synchronization of pthreads 6-21 

synchronizing calls 5-9,5-13,5-14 

synchronizing operations 5-29,5-30 
summary 5-48 

synchronous and asynchronous calls 8-10 

synchronous and asynchronous I/O calls 8-24 

synchronous file I/O calls 5-25 

synchronous message-passing calls 3-7,3-8 
with pthreads 6-37 

sys/estath file 5-37 

sysacct command 5-4 

syslog command B-6 


lndex-25 



Index 


Paragon"* User’s Guide 


system administrator 2-26,2-33,2-56 
system buffers 3-14,8-11 
system calls 

asynchronous file I/O calls 5-27 
asynchronous message-passing calls 3-10 
backward compatibility 4-52, B-9 
closing files in parallel 5-28 
controlling application execution 4-2 
controlling tape devices 5-44 
detecting end-of-fiie 5-29 
error handling 4-42,8-7 

in parallel file I/O calls 5-26 
extended arithmetic 5-37 
floating-point control 4-46 
flushing Fortran buffered I/O 5-30 
global operations 3-27,7-6 
I/O modes 5-13 
I/O to parallel files 5-24 
increasing the size of a file 5-32 
information about messages 3-15 
iPSC system compatibility 4-52, B-9 
manipulating extended files 5-36 
message buffers 3-14,8-11 
message passing with Fortran commons 3-17 
message-passing 3-1 
miscellaneous 4-50 
moving the file pointer 5-29 
names of message-passing calls 3-7 
opening files in parallel 5-9 
other system calls 4-1 
parallel file I/O 5-8 
parallel file I/O synchronization 5-48 
partition management 4-27 
reading and writing files in parallel 5-24 
repeated 8-2 

summary of C system calls A-4 
summary of Fortran system calls A-28 
synchronization 5-48 
synchronous file I/O calls 5-25 
synchronous message-passing calls 3-8 
timing 4-50 

treating messages as interrupts 3-18 
underscore versions 4-42,8-7 
with pthreads 6-43 


system hardware 1-2 
system message buffers 8-16 
system software 1-4 
System V UNIX B-9 
-sz switch 2-16,3-3 

T 

tail command 5-35 

tape devices, controlling 5-44 

tapemode command B-7 

tar command 5-35 

task decomposition 7-5 

techniques for improving performance 8-1 

techniques for parallel programming 7-2 

telnet command 1-5 

temporarily releasing control of the processor 4-50 

terminal I/O 5-31 

terminology 2-1 

threads 1-1,6-1 

(see also "pthreads") 

thread-safe software 6-1 

tiling 2-37 

timing execution 4-50 
tips for compiling and linking 2-8 
tools for program development 1-6 
topics in this manual v 
topologies 7-6 

Touchstone DELTA System 4-52 
treating a message as an interrupt 3-18 
tree search 7-5 


lndex-26 



Paragon 1 " User's Guide 


index 


triangle example 7-18 
truncateO system call 5-31,5-34 
type of a message 3-6,3-15 
type parameter 3-6 
typesel masks 3-6 
typesel parameter 3-6 


/usr/ccs/lib directory 2-8 
/usr/include directory 2-8 
/usr/lib directory 2-8 
/usr/paragon/XDEV directory 2-9 
/usr/tmp directory 5-13 
utimesQ system call 5-31 


U 

UFS file systems 5-2 

underscore versions of system calls 4-42,8-7 
with pthreads 6-43 

understanding message-passing flow control 8-13 

unformatted files 5-11,5-26 

unit stride 8-4 

units (Fortran I/O) 5-8 

UNIX System V B-9 

unlink() system call 5-31 

unlocked_...() system calls 6-10 

unlocked_fseek() system call 5-34 

unlocking pthreads 6-16 

unnamed files 5-13 

unusable nodes 2-31,2-48,4-40 

uppercase .F extension 2-9 

user interface of an application 7-3 

user model 1-5 

differences from iPSC system B-5 

using Paragon OSF/1 commands 
on the Intel supercomputer 2-2 
on workstations 2-2 

using PIDs 4-14 

using the default partition 2-14 


V 

variables 

CFS.MOUNT 5-13 
ermo 4-42 

with pthreads 6-41 
IPSC.XDEV B-4 
MANPATH 2-6 
NX_DFLT_PART 2-12,2-14 
NX_DFLT_SIZE 2-12,2-17,2-22 
PARAGON_XDEV 2-6,2-9 
PATH 2-6 

vector library 8-6 

vector multiplication 7-11 

vector operations 3-27 

vector size 8-5 

vi command 5-35 

virtual memory 1-5,8-3,8-15 

virtual topology 7-6 

visualization of performance 8-10 

vm_stat command 8-3 


W 

wait() system call, with pthreads 6-9 
waitallO system call B-12 
waitcube command B-6 
waiting for application processes 4-14 


lndex-27 




Index 


Paragon™ User’s Guide 


waitone() system call B-13 

wildcards in partition pathnames 2-28 

wiring memory 8-15 

workstations 

architecture of 2-6 
using commands on 2-2 

workstations, working at 1-6 

write (w) permission on a partition 2-33 

write statement 5-24 

write() system call 5-25 
with pthreads 6-4 

writev() system call 5-25 

writing files in parallel 5-24 

X 

x (execute) permission on a partition 2-33 
xprof and xgprof commands 8-2 

Y 

yellow LEDs 1-4 

Z 

zero-length messages 3-6 







Index-28 






